EMNLP 2024

[ADASWITCH: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning 🔗](https://arxiv.org/abs/2410.13181)

The Best of Both Worlds: How ADASWITCH Combines Tiny Local Models with Giant Cloud Brains

The current landscape of Artificial Intelligence presents a frustrating dichotomy for engineers and users alike. On one side, we have Cloud-based Large Language Models (LLMs) like GPT-4 or Claude 3 Opus. They are incredibly smart, capable of complex reasoning, and hold vast amounts of knowledge. However, they are expensive to run, rely on internet latency, and raise data privacy concerns. On the other side, we have Local LLMs—smaller models like Llama-3-8B or Phi-3 that can run directly on your laptop or even a phone. They are fast, free to run after deployment, and private. The catch? They often struggle with complex reasoning. Ask them a multi-step logic puzzle, and they are prone to “hallucinating” or losing the thread of logic halfway through. ...

[ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities 🔗](https://arxiv.org/abs/2410.03907)

Can AI Really Clean Your Kitchen? Benchmarking VLM Planning with ActPlan-1K

Introduction Imagine asking a robot to “assemble gift baskets” in your living room. A standard Large Language Model (LLM) might give you a perfect textual list of instructions: find the basket, put in the cookies, add the cheese. But what if the robot looks at the table and sees that the cookies are burnt? What if the water meant for the plants is shut off? This is the frontier of Embodied AI—moving beyond generating text to generating actionable plans based on what an agent actually sees. While LLMs have demonstrated incredible reasoning abilities, we are still figuring out how well Vision Language Models (VLMs) handle complex, multi-modal procedural planning. Can they integrate visual cues with textual goals? Can they handle “counterfactual” scenarios where things go wrong? ...

[Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree 🔗](https://arxiv.org/abs/2410.12217)

Beyond Majority Voting: Predicting Individual Toxicity Ratings with Personal Context

Introduction In the world of Natural Language Processing (NLP), we often treat data labeling as a search for a single truth. If we ask five people to label a comment as “toxic” or “not toxic,” and three say it is, we typically take the majority vote and discard the dissenting opinions as noise. But is that disagreement really noise? Consider the phrase: “You’re an idiot.” To a close friend in a gaming chat, this might be playful banter. To a stranger in a political debate, it is an insult. What one group considers acceptable, another might find deeply offensive. By aggregating these distinct perspectives into a single “ground truth,” we strip the data of its inherent social variance. We lose the “who” behind the label. ...

[Academics Can Contribute to Domain-Specialized Language Models 🔗](https://aclanthology.org/2024.emnlp-main.293.pdf)

Why Academics Should Stop Chasing Leaderboards and Start Specializing

Introduction: The David vs. Goliath Problem in NLP If you are a student or researcher in Natural Language Processing (NLP) today, you are likely feeling the pressure of “The Scale.” A few years ago, a university lab could train a state-of-the-art model on a few GPUs. Today, the leaderboard is dominated by commercial giants—OpenAI, Google, Anthropic, and Meta. These organizations train massive general-purpose Large Language Models (LLMs) using computational resources and datasets that are simply out of reach for academic institutions. ...

[ATM: Adversarial Tuning Multi-agent System Makes a Robust Retrieval-Augmented Generator 🔗](https://arxiv.org/abs/2405.18111)

How to Tame the Noise: Making RAG Systems Robust with Adversarial Multi-Agent Tuning

Introduction In the current landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 and Llama are incredibly powerful. However, they suffer from a well-known flaw: hallucinations. When an LLM doesn’t know an answer, it often makes one up. To solve this, the industry adopted Retrieval-Augmented Generation (RAG). The premise of RAG is simple: before the model answers a question, it searches a database (like Wikipedia or a company archive) for relevant documents and uses that information to generate an accurate response. It’s like letting a student take an open-book exam. ...

[ATAP: Automatic Template-Augmented Commonsense Knowledge Graph Completion via Pre-Trained Language Models 🔗](https://aclanthology.org/2024.emnlp-main.919.pdf)

Bridging the Gap: How ATAP Automates Commonsense Reasoning with Continuous Prompts

Introduction Imagine teaching a computer that “umbrellas are used for rain.” To a human, this is obvious—it’s commonsense. To a computer, however, this relationship must be explicitly taught or inferred. We often store this kind of information in Commonsense Knowledge Graphs (CKGs), which structure data into triples like (Umbrella, UsedFor, Rain). While these graphs are powerful, they are inherently incomplete. There are millions of potential commonsense facts, and manually cataloging them all is impossible. This leads to the challenge of Commonsense Knowledge Graph Completion (CKGC): given a head entity and a relation (e.g., Oregon, AtLocation, ?), can the model predict the missing tail entity (USA)? ...

[ASL STEM Wiki: Dataset and Benchmark for Interpreting STEM Articles 🔗](https://arxiv.org/abs/2411.05783)

Bridging the Gap: AI, ASL, and the Challenge of STEM Education

Imagine trying to learn advanced quantum physics or organic chemistry, but every time a technical term like “electromagnetism” or “photosynthesis” comes up, your teacher stops speaking and slowly spells out the word, letter by letter. This is the reality for many Deaf and Hard-of-Hearing (DHH) students. While American Sign Language (ASL) is a rich and expressive language, it faces a significant bottleneck in STEM education: a lack of standardized signs for technical concepts. ...

[ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings 🔗](https://arxiv.org/abs/2402.16006)

Breaking the Mold: How Translating Embeddings Creates Fluent LLM Jailbreaks

Introduction The rapid rise of Large Language Models (LLMs) like ChatGPT, Llama, and Vicuna has revolutionized automated text generation. However, with great power comes great vulnerability. These models are trained with safety guardrails to refuse harmful instructions—a process known as alignment. For security researchers, the goal is to test these guardrails through “jailbreak” attacks, probing the model to see if it can be tricked into generating dangerous content. For a long time, jailbreaking was a manual art form. Users would craft complex role-playing scenarios (like the infamous “Do Anything Now” prompts) to bypass safety filters. Recently, automated methods like GCG (Greedy Coordinate Gradient) have attempted to use optimization algorithms to find these jailbreaks automatically. While effective, these methods have two major flaws: they are computationally expensive, taking a long time to run, and they produce “gibberish” suffixes—random strings of characters that are easily detected by simple software filters. ...

[ARM: An Alignment-and-Replacement Module for Chinese Spelling Check Based on LLMs 🔗](https://aclanthology.org/2024.emnlp-main.567.pdf)

Taming the LLM: How ARM Integrates Large Language Models into Chinese Spelling Checks

Introduction Imagine you are trying to text a friend in Chinese. You want to say, “I slept so deeply,” but you accidentally type a character that sounds similar but means something completely different. In Chinese, where thousands of characters share similar pronunciations (homophones) or visual structures, this is a constant struggle. This is the domain of Chinese Spelling Check (CSC). For years, the standard approach to solving this problem involved “traditional” deep learning models like BERT. These models are excellent at following rules—specifically, they know that if you input a 10-character sentence, you expect a 10-character correction back. They treat the task as a sequence labeling problem: look at character \(A\), decide if it’s wrong, and if so, replace it with character \(B\). ...

[APPLS: Evaluating Evaluation Metrics for Plain Language Summarization 🔗](https://arxiv.org/abs/2305.14341)

Can AI Judge Simplicity? Inside the APPLS Testbed for Plain Language Summarization

Science has a communication problem. While researchers are making breakthroughs at an unprecedented pace, the resulting papers are often dense, jargon-filled, and inaccessible to the general public. This gap has given rise to the task of Plain Language Summarization (PLS)—rewriting complex scientific abstracts into clear, accessible language that a non-expert can understand. With the rise of Large Language Models (LLMs), automating this process seems within reach. But there is a catch: How do we know if a summary is actually "plain"? ...

[AMR-Evol: Adaptive Modular Response Evolution Elicits Better Knowledge Distillation for Large Language Models in Code Generation 🔗](https://arxiv.org/abs/2410.00558)

Building Better Code Models: How AMR-Evol Fixes Knowledge Distillation

Introduction In the current landscape of Artificial Intelligence, proprietary Large Language Models (LLMs) like GPT-4, Gemini, and Claude dominate the leaderboard, particularly in code generation. Their ability to write complex Python scripts or debug software is impressive. However, their closed-source nature raises concerns regarding data privacy, cost, and accessibility. This has fueled a race to develop open-source alternatives (like CodeLlama, DeepSeek-Coder, or StarCoder) that can match the performance of these proprietary giants. The primary method for doing this is Knowledge Distillation. In this process, a powerful “teacher” model (like GPT-4) generates synthetic training data—specifically, instruction-response pairs—which are then used to train a smaller “student” model. ...

[AMPO: Automatic Multi-Branched Prompt Optimization 🔗](https://arxiv.org/abs/2410.08696)

Beyond Linear Thinking: How AMPO Revolutionizes Prompt Engineering with Multi-Branched Logic

Introduction: The Problem with “One Size Fits All” In the rapidly evolving world of Large Language Models (LLMs), prompt engineering has become an art form. We spend hours crafting the perfect instructions, tweaking adjectives, and adding “Let’s think step by step” to squeeze better performance out of models like GPT-4. However, there is a fundamental limitation in how we currently approach this: Linearity. Most automatic prompt optimization methods—and even human engineers—tend to create a “single flow” of instructions. We try to write one coherent paragraph that covers every possible scenario. But real-world tasks are messy. A medical diagnosis problem requires a different logical path than a general knowledge question. When we force an LLM to follow a single rigid path for diverse inputs, performance suffers. ...

[ALVIN: Active Learning Via INterpolation 🔗](https://arxiv.org/abs/2410.08972)

Breaking the Shortcut: How ALVIN Revolutionizes Active Learning with Interpolation

Introduction In the era of Large Language Models (LLMs), we often marvel at zero-shot capabilities. Yet, for critical applications, supervised fine-tuning remains the gold standard. The challenge, as always, is the data. Collecting high-quality, labeled data is expensive, slow, and labor-intensive. This “annotation bottleneck” is the primary driver behind Active Learning (AL). The goal of Active Learning is simple: maximize model performance while minimizing the amount of data we need to label. Instead of labeling a random 10% of a massive dataset, an AL algorithm scans the unlabeled pool and tells us, “Label these specific examples; they are the most important.” Typically, these algorithms look for instances where the model is uncertain or confused. ...

[AKEW: Assessing Knowledge Editing in the Wild 🔗](https://arxiv.org/abs/2402.18909)

Knowledge Editing in the Wild: Why LLM Surgery is Harder Than We Thought

Large Language Models (LLMs) are frozen in time. When a model like GPT-4 or Llama 2 finishes training, its knowledge of the world is locked to that specific moment. But the world doesn’t stop. Presidents change, companies merge, and scientific discoveries overturn old theories. So, how do we keep these models up to date? The obvious answer is to retrain them, but that is prohibitively expensive and slow. This has given rise to a fascinating sub-field called Knowledge Editing. The goal is simple: surgically alter the model’s parameters (or its behavior) to inject a specific new fact without breaking everything else it knows. ...

[AGRAME: Any-Granularity Ranking with Multi-Vector Embeddings 🔗](https://arxiv.org/abs/2405.15028)

Zooming In: How AGRAME Enables Multi-Granularity Search with Single-Level Encoding

Search engines have evolved dramatically, but they often suffer from a “resolution” problem. Imagine you are looking for a specific needle in a haystack. Most modern retrieval systems are great at handing you the haystack (the document or passage) but struggle to pinpoint the needle (the specific sentence or fact) without expensive re-indexing. In the world of Information Retrieval (IR), this is known as the granularity problem. Do you index your data by document? By paragraph? By sentence? Usually, you have to choose one level of granularity and stick with it. If you choose passages, finding specific sentences becomes hard. If you choose sentences, you lose the broader context of the passage. ...

[ACE: A LLM-based Negotiation Coaching System 🔗](https://arxiv.org/abs/2410.01555)

Mastering the Deal: How AI is Learning to Teach Negotiation Skills

Negotiation is one of the most stressful yet essential skills in modern life. Whether you are buying a car, discussing a salary, or settling a rent price, the ability to advocate for yourself directly impacts your financial well-being. Unfortunately, effective negotiation is rarely taught in schools. It is a “reflexive behavioral habit” usually refined only through expensive MBA seminars involving role-playing and expert coaching. This exclusivity creates a gap: populations that could benefit most from these skills—such as women and minorities, who are statistically often less accustomed to self-advocacy—lack access to high-quality training. ...

[ABSEval: An Agent-based Framework for Script Evaluation 🔗](https://aclanthology.org/2024.emnlp-main.691.pdf)

Can LLMs Plan? Introducing ABSEval, the Multi-Agent Framework for Evaluating Script Generation

Large Language Models (LLMs) have conquered conversation. They can write poetry, debug code, and summarize history. But ask an LLM to plan a sequence of actions—like “How do I open a can with a spoon without making a mess?"—and you enter a different territory entirely. This is the domain of Script Planning. While LLMs often seem confident, their step-by-step plans can be riddled with subtle logic errors, missing steps, or physical impossibilities. Even harder than generating these scripts is evaluating them. How do we automatically measure if a plan makes sense without a human reading every single one? ...

[ABLE: Personalized Disability Support with Politeness and Empathy Integration 🔗](https://aclanthology.org/2024.emnlp-main.1252.pdf)

Beyond Generic Chatbots: How ABLE Uses RL to Bring Empathy and Personalization to Disability Support

Introduction Imagine navigating a world not designed for you. For the over one billion people globally living with some form of physical disability, this is a daily reality. Whether it’s finding wheelchair-accessible housing, managing chronic pain, or dealing with the social isolation that often accompanies physical limitations, the need for reliable support is massive. In the age of AI, conversational agents (chatbots) offer a promising solution. They are available 24/7 and can provide immediate information. However, there is a glaring problem with most current systems: they are robotic. If a user expresses frustration about losing mobility, a standard chatbot might output a sterile list of medical definitions. It lacks the “human” touch—the ability to understand the user’s specific personality, age, and gender, and to respond with genuine empathy and politeness. ...

[A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks 🔗](https://arxiv.org/abs/2501.17569)

Beyond the Leaderboard: Why Large Language Models Still Fail at Reading Comprehension

In the fast-paced world of Natural Language Processing (NLP), we are often dazzled by the sheer scale of modern models. From GPT-4 to LLaMA, the headlines focus on parameter counts—billions upon trillions—and their dominance on standardized leaderboards. But there is a quiet, persistent problem in the field: the “black box” nature of evaluation. We know that models fail. We see them hallucinate, miss obvious details, or misinterpret simple questions. However, looking at a global accuracy score on a benchmark like SQuAD or SuperGLUE doesn’t tell us why they fail. Is it the syntax? Is it the vocabulary? Is it the ambiguity of the meaning? ...

[A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models 🔗](https://arxiv.org/abs/2404.13940)

Beyond Standard Tests: Evaluating LLMs Based on What Users Actually Want

Introduction Imagine a student who aces every written exam in history, mathematics, and computer science but struggles to hold a conversation, offer advice to a friend, or brainstorm a creative gift idea. In the world of Artificial Intelligence, this is a common paradox. We have Large Language Models (LLMs) that score near-perfect marks on standardized tests like the Bar Exam or math Olympiad questions, yet they sometimes fail to satisfy simple, messy, real-world user requests. ...