EMNLP 2024

[Language models and brains align due to more than next-word prediction and word-level information 🔗](https://arxiv.org/abs/2212.00596)

More Than Just Prediction: Why Language Models and Human Brains Actually Align

Introduction In recent years, a fascinating intersection has emerged between artificial intelligence and neuroscience. Large Language Models (LMs)—the technology behind systems like GPT—have demonstrated an uncanny ability to predict human brain activity. When a human reads a book inside an fMRI scanner, the internal activations of an LM processing that same text can be mapped surprisingly well to the biological signals in the human’s brain. This phenomenon has sparked a major debate: Why do they align? ...

[Language is Scary when Over-Analyzed: Unpacking Implied Misogynistic Reasoning with Argumentation Theory-Driven Prompts 🔗](https://arxiv.org/abs/2409.02519)

Reading Between the Lines: Can LLMs Detect Implicit Misogyny?

Introduction Content moderation has come a long way. If you post a slur or a blatantly violent threat on social media today, there is a high probability that an automated system will flag it and remove it within hours. The algorithms trained to spot explicit keywords are efficient. However, hate speech is evolving. It is becoming quieter, subtler, and more insidious. Consider the difference between a direct insult and a sarcastic remark that relies on a shared, negative stereotype. The former is easy for a machine to catch; the latter requires cultural context and reasoning capabilities that most models lack. This is the domain of implicit hate speech. ...

[Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models 🔗](https://arxiv.org/abs/2404.02575)

How Thinking Like a Compiler Boosts AI Reasoning

Introduction Large Language Models (LLMs) like GPT-4 and Claude are incredibly proficient at generating human-like text, writing poetry, and even explaining complex historical events. However, there is a specific domain where these models often stumble: algorithmic reasoning. Algorithmic reasoning isn’t just about answering a question; it’s about adhering to a strict set of rules, decomposing a complex problem into a sequence of steps, and maintaining the state of variables throughout the process. A classic example is a logic puzzle involving multiple people lying or telling the truth (a “Web of Lies”), or navigating a grid based on a sequence of turns and steps. ...

[Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs 🔗](https://arxiv.org/abs/2403.19827)

How AI Learns the Unseen: The Mystery of the 'Beautiful Five Days'

How AI Learns the Unseen: The Mystery of the “Beautiful Five Days” Imagine you are reading a book and you come across the phrase: “a beautiful five days.” To a native English speaker, this sounds perfectly natural. You might say, “We spent a beautiful five days in Rome.” But if you pause and look at the grammar, something strange is happening. The word “a” is a singular article (used for one thing, like “a dog”). The phrase “five days” is plural. In strict grammatical terms, combining a singular article with a plural noun phrase should be a disaster. We don’t say “a days” or “a five dogs.” Yet, the “Article + Adjective + Numeral + Noun” (AANN) construction is perfectly acceptable English. ...

[Language Concept Erasure for Language-invariant Dense Retrieval 🔗](https://aclanthology.org/2024.emnlp-main.736.pdf)

Breaking the Language Barrier: How LANCER Teaches Models to Ignore Language Identity

Imagine you are looking for instructions on how to bake a specific type of pastry. You type your query into a search engine. Somewhere out there, the perfect recipe exists, written by a master baker. However, that baker wrote the recipe in Italian, and you searched in English. Ideally, a modern semantic search engine should find that document. After all, “flour” means “farina,” and the semantic intent of baking is the same regardless of the language used to describe it. If we have a translation tool, the language of the document shouldn’t matter—only the meaning should. ...

[Ladder: A Model-Agnostic Framework Boosting LLM-based Machine Translation to the Next Level 🔗](https://arxiv.org/abs/2406.15741)

Can We Fix Broken Translations? Introducing MT-Ladder: The "Spell-Check" for LLM Translators

Language barriers are arguably the biggest obstacle to global communication, and for a long time, Machine Translation (MT) has been the battering ram trying to break them down. In recent years, Large Language Models (LLMs) like GPT-4 have revolutionized this field, offering translations that are not just accurate but contextually rich. However, there is a catch. To get top-tier translation performance, you typically have two options: Use a massive, general-purpose LLM (like GPT-4): This yields excellent results but comes with exorbitant infrastructure and deployment costs. Train a translation-specific LLM (like ALMA): This involves pre-training on billions of tokens and fine-tuning on millions of high-quality, human-annotated translation pairs. This is resource-intensive and expensive due to the need for human labor. This creates a significant gap. Is it possible to take a smaller, open-source model and boost its translation capabilities to rival the giants, without spending a fortune on human annotation or massive compute? ...

[Label Confidence Weighted Learning for Target-level Sentence Simplification 🔗](https://aclanthology.org/2024.emnlp-main.999.pdf)

Making Sense of Noise: How Label Confidence Weighted Learning Revolutionizes Text Simplification

Imagine trying to explain a complex scientific concept to a 5-year-old, then to a 10-year-old, and finally to a high schooler. You would change your vocabulary, sentence structure, and tone for each “target” audience. This is the essence of Target-level Sentence Simplification. While humans do this naturally, teaching machines to generate text at specific complexity levels (like “Grade 3” vs. “Grade 8”) is notoriously difficult. The primary bottleneck isn’t the model architecture; it’s the data. We simply don’t have enough high-quality parallel datasets—pairs of complex sentences aligned with their simplified versions across multiple grade levels. ...

[LUQ: Long-text Uncertainty Quantification for LLMs 🔗](https://arxiv.org/abs/2403.20279)

When LLMs Ramble: Measuring Uncertainty in Long-Form Text Generation

Large Language Models (LLMs) like GPT-4 and Gemini have transformed how we interact with information. We ask them to write emails, summarize complex topics, and even generate biographies of historical figures. But there is a well-known catch: hallucinations. An LLM can speak with absolute confidence while fabricating facts entirely. For simple “Yes/No” questions or multiple-choice classifications, determining if a model is uncertain is relatively straightforward. We can look at the probability scores (logits) of the output tokens. But how do we measure confidence when the model generates a 300-word biography? If the model writes three paragraphs about a disease, how do we know which sentences are factual and which are creative fiction? ...

[LONGAGENT: Achieving Question Answering for 128k-Token-Long Documents through Multi-Agent Collaboration 🔗](https://aclanthology.org/2024.emnlp-main.912.pdf)

How Multi-Agent Collaboration Beats GPT-4 on Long Documents: Inside LONGAGENT

Introduction In the rapidly evolving world of Large Language Models (LLMs), the race for “context window” supremacy has been fierce. We’ve gone from models that could barely hold a conversation history to behemoths like GPT-4 and Claude 2, which boast context windows of 128k and 200k tokens respectively. Ideally, this means you should be able to feed a model an entire novel, a legal repository, or a massive technical manual, and ask it any question. ...

[LLoCO: Learning Long Contexts Offline 🔗](https://arxiv.org/abs/2404.07979)

The Cheat Sheet Strategy: How LLoCO Masters Long Contexts Efficiently

The Cheat Sheet Strategy: How LLoCO Masters Long Contexts Efficiently Imagine you are a student preparing for a grueling final exam covering an entire textbook. You have three ways to tackle this. First, the “Open Book” approach: you bring the entire textbook into the exam hall. You have all the information, but flipping through thousands of pages to find one specific answer takes forever. Second, the “Closed Book” approach: you rely solely on what you memorized. It’s fast, but if the exam asks about specific details from page 342, you’re out of luck. ...

[LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training 🔗](https://arxiv.org/abs/2406.16554)

Turning Giants into Specialists: How to Build a Mixture-of-Experts Model from LLaMA

Introduction In the current landscape of Artificial Intelligence, scaling laws have ruled supreme: if you want a smarter model, you make it bigger. Models have ballooned from millions to billions, and now trillions of parameters. However, we are hitting a wall. The sheer computational cost of running inference on these massive dense models is becoming unsustainable for many researchers and applications. Enter the Mixture-of-Experts (MoE) architecture. MoE promises the best of both worlds: it decouples the model’s total size (capacity) from its computational cost (inference latency). By only activating a tiny fraction of the network for each token it processes, an MoE model can have the knowledge capacity of a giant model while running with the speed of a much smaller one. ...

[LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law 🔗](https://aclanthology.org/2024.emnlp-main.842.pdf)

Can LLMs Learn Physics? Uncovering In-Context Neural Scaling Laws

When we think of Large Language Models (LLMs) like LLaMA or GPT-4, we usually think of them as masters of language. They write poetry, summarize emails, and debug code. But at their core, these models are sequence predictors—they look at a stream of tokens and predict what comes next. This raises a fascinating question: If the sequence isn’t text, but data from a physical system, can the LLM learn the laws of physics just by looking at the numbers? ...

[LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing 🔗](https://arxiv.org/abs/2406.16253)

Can AI Grade AI? The Truth About LLMs in Peer Review

The world of academic research is facing a crisis of scale. Every year, the number of paper submissions to top-tier Artificial Intelligence conferences skyrockets. For the researchers on the receiving end, this means an ever-growing pile of papers to read, critique, and review. It is a workload that is becoming unsustainable. Enter Large Language Models (LLMs). We know they can write poetry, debug code, and pass the bar exam. Naturally, the question arises: Can LLMs help relieve the burden of peer review? ...

[LLMs Are Zero-Shot Context-Aware Simultaneous Translators 🔗](https://arxiv.org/abs/2406.13476)

Context is King: How Off-the-Shelf LLMs are Mastering Simultaneous Translation

Introduction Imagine you are a simultaneous interpreter at a high-stakes medical conference. A speaker rushes to the podium and begins talking rapidly about cardiology. They mention a patient suffering from “PVC.” If you are just translating word-for-word, you might stumble. Is it Polyvinyl Chloride (a plastic)? No, in this context, it stands for Premature Ventricular Contraction. To make that distinction instantly, you need context. You need to know the topic is cardiology. You might even have a glossary prepared beforehand. ...

[LLMEdgeRefine: Enhancing Text Clustering with LLM-Based Boundary Point Refinement 🔗](https://aclanthology.org/2024.emnlp-main.1025.pdf)

Taming the Outliers: How LLMEdgeRefine Revolutionizes Text Clustering

Introduction Imagine you are a librarian tasked with organizing a massive pile of books into genres. Most books are easy: the ones with spaceships go into Science Fiction, and the ones with dragons go into Fantasy. But what do you do with a book about a dragon flying a spaceship? Or a book with a torn cover and a vague title? In the world of data science, this is the problem of Text Clustering. It is a fundamental task where we group similar documents together without having any prior labels. While it sounds simple, it is notoriously difficult to get right. Traditional mathematical methods often choke on “edge points”—the ambiguous data points or outliers that don’t fit neatly into a circle. On the other hand, modern Large Language Models (LLMs) like GPT-4 are brilliant at understanding these nuances, but using them to read every single document in a massive dataset is prohibitively expensive and slow. ...

[LLM4Decompile: Decompiling Binary Code with Large Language Models 🔗](https://arxiv.org/abs/2403.05286)

From Binary to C: How LLM4Decompile is Revolutionizing Reverse Engineering

Imagine you find an old executable file on a server. It’s a critical piece of legacy software for your company, but there’s one problem: the source code is gone. No GitHub repo, no backup zip files. Just the raw binary. To update or fix this software, you need to perform decompilation—the process of reversing the compilation steps to turn machine code back into a human-readable programming language like C. For decades, this has been the realm of specialized tools like Ghidra or IDA Pro. While powerful, these tools often produce output that looks more like a mathematical riddle than clean code. ...

[LLM-based Code-Switched Text Generation for Grammatical Error Correction 🔗](https://arxiv.org/abs/2410.10349)

Fixing the Unfixable: How LLMs are Teaching AI to Correct Code-Switched Text

Imagine you are an English learner whose native language is Japanese. You are chatting with a friend and you type: “According to the test, my shortcomings are 靴下 and ご主人様.” To a bilingual speaker, this sentence makes perfect sense. It’s a classic example of Code-Switching (CSW)—the fluid alternation between two or more languages in a single conversation. It is a sign of linguistic competence, not confusion. However, if you feed that sentence into a standard Grammatical Error Correction (GEC) system (like the ones powering your favorite writing assistants), it will likely fail. It might flag the Japanese characters as “spelling errors,” try to delete them, or hallucinate English words in their place. ...

[LLM-Evolve: Evaluation for LLM's Evolving Capability on Benchmarks 🔗](https://aclanthology.org/2024.emnlp-main.940.pdf)

Can LLMs Learn from Experience? Inside the LLM-Evolve Framework

Introduction Imagine you are taking a difficult math exam. On the first question, you struggle, make a guess, and get it wrong. But immediately after, you are shown the correct solution. When you encounter a similar problem five questions later, you recall that solution, apply the logic, and get it right. You are learning from experience. Now, consider how we evaluate Large Language Models (LLMs). We typically use benchmarks like MMLU or GSM8K, where models answer thousands of questions. However, in these standard evaluations, every question is treated as an isolated event. The model doesn’t “remember” solving question #1 when it tackles question #100. It doesn’t get to learn from its successes or failures during the test. ...

[LLM-Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay 🔗](https://arxiv.org/abs/2310.14985)

Can AI Lie, Lead, and Deceive? Inside the Social Minds of LLM Agents Playing Avalon

Introduction In recent years, we have witnessed a paradigm shift in Artificial Intelligence. Large Language Models (LLMs) like GPT-4 and LLaMA have moved beyond simple text generation to becoming the brains of autonomous agents—digital entities capable of perceiving environments, making decisions, and taking actions. We have seen agents simulate software development companies and inhabit virtual “Sims-like” towns. However, most of these simulations have focused on positive, cooperative behaviors. But human society isn’t just about holding hands and working together. It is a complex web of negotiation, confrontation, deception, and trust. To truly understand the potential (and risks) of LLM-based societies, we need to see how they handle conflict and incomplete information. ...

[LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History 🔗](https://arxiv.org/abs/2402.18216)

The Distracted AI: How Task-Switching Derails Large Language Models

Introduction Imagine you are deep in a conversation with a friend about the nuance of 19th-century literature. You are analyzing themes, tone, and character development. Suddenly, without warning, your friend asks you to solve a complex algebraic equation. For a moment, your brain stumbles. The cognitive context you built up for literature doesn’t translate to math; in fact, it might even get in the way. It turns out, Large Language Models (LLMs) suffer from a very similar phenomenon. ...