EMNLP 2024

[Effective Demonstration Annotation for In-Context Learning via Language Model-Based Determinantal Point Process 🔗](https://arxiv.org/abs/2408.02103)

Smart Annotation: Optimizing In-Context Learning with Limited Data using LM-DPP

In the era of Large Language Models (LLMs), the paradigm of how we teach machines has shifted dramatically. We no longer always fine-tune models by updating millions of parameters; instead, we often rely on In-Context Learning (ICL). This involves feeding the model a few input-output examples (demonstrations) in the prompt, allowing it to “learn” the pattern on the fly. However, there is a catch. For ICL to work well, the examples you choose matter—a lot. Typically, finding the best examples requires retrieving them from a massive dataset of already labeled examples. But what if you don’t have a massive labeled dataset? What if you have a huge pile of raw text and only enough budget to manually label 50 or 100 examples? ...

[EXPLORA: Efficient Exemplar Subset Selection for Complex Reasoning 🔗](https://arxiv.org/abs/2411.03877)

How to Pick the Perfect Prompts: Inside EXPLORA's Efficient Exemplar Selection

Introduction Imagine you are trying to teach a brilliant but literal-minded student how to solve complex physics problems. You don’t have time to teach them the entire textbook. Instead, you can only show them five specific examples of how to solve a problem before giving them the final exam. Which five examples do you choose? Do you pick five simple ones? Five incredibly hard ones? Or a mix of different reasoning styles? If you pick the wrong ones, the student fails. If you pick the right ones, the student excels. ...

[EVEDIT: Event-based Knowledge Editing for Deterministic Knowledge Propagation 🔗](https://aclanthology.org/2024.emnlp-main.282.pdf)

Why Your LLM is Confused: The Case for Event-Based Knowledge Editing

Imagine you are updating a Wikipedia page. You need to change a key fact: “Lionel Messi is now a Dutch citizen.” If you simply update that single data point in a database, it’s fine. But Large Language Models (LLMs) aren’t databases; they are reasoning engines built on a web of correlations. If you force an LLM to believe Messi is Dutch without context, a ripple effect of confusion occurs. When you ask, “Where was Messi born?”, the model might now hallucinate “Amsterdam” because, in its statistical view of the world, Dutch citizens are usually born in the Netherlands. ...

[ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers 🔗](https://arxiv.org/abs/2404.19441)

Transformers Take Over Audio: Understanding the Efficient Speech Codec (ESC)

In the age of real-time communication—think Zoom calls, Discord chats, and streaming services—the way we compress audio data is critical. We demand high fidelity, low latency, and minimal data usage. For years, the industry has relied on traditional Digital Signal Processing (DSP) codecs like Opus or MP3. However, recently, Neural Audio Codecs have taken center stage, using deep learning to compress audio far more efficiently than hand-engineered rules ever could. ...

[ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models 🔗](https://arxiv.org/abs/2406.14952)

Can AI Really Be Your Therapist? Introducing ESC-Eval, A Framework for Testing Digital Empathy

Introduction In the last few years, we have witnessed a seismic shift in how humans interact with machines. We aren’t just asking Siri for the weather anymore; we are venting to ChatGPT about our stressful days, asking Claude for relationship advice, and seeking comfort from Llama when we feel isolated. This specific domain is known as Emotional Support Conversation (ESC). The promise of ESC is immense. In a world where mental health resources are often scarce or expensive, an always-available AI companion that can reduce stress and offer guidance sounds like a utopian dream. But there is a massive hurdle standing between us and that reality: Evaluation. ...

[ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments 🔗](https://arxiv.org/abs/2410.06420)

Can AI Doctors See? Benchmarking Vision Language Models in the ER

Introduction Imagine a busy emergency room. Doctors and nurses are rushing between patients, machines are beeping, and decisions need to be made in split seconds. Now, imagine an AI assistant in the corner, observing the scene through a camera, ready to alert the staff if a patient pulls out an IV line or if a ventilator setting looks wrong. This sounds like the future of healthcare, doesn’t it? With a global shortage of over 6 million physicians, the promise of autonomous medical agents is alluring. But before we hand over the reins to Artificial Intelligence, we have to ask a critical question: Do these models actually understand what they are looking at? ...

[EPO: Hierarchical LLM Agents with Environment Preference Optimization 🔗](https://arxiv.org/abs/2408.16090)

Training Robot Agents Without Human Labels: A Deep Dive into Environment Preference Optimization (EPO)

Imagine asking a robot to “heat up a cup of coffee.” To you, this is a simple request. To a robot (or an embodied AI agent), this is a logistical nightmare. It involves navigation, object detection, grasping, opening a microwave, and understanding the concept of “heating.” Large Language Models (LLMs) like GPT-4 or Llama have shown incredible reasoning capabilities, but applying them to long-horizon physical tasks remains a massive hurdle. The standard approach requires feeding the model thousands of human-annotated examples of exactly how to perform a task. But human annotation is slow, expensive, and unscalable. ...

[EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records 🔗](https://arxiv.org/abs/2401.07128)

Bridging the Gap Between Doctors and Data: How EHRAgent Turns Medical Questions into Code

Imagine a busy clinician in an Intensive Care Unit. They need to know something specific, and they need to know it now: “How many patients were prescribed aspirin within two months of receiving a venous catheter procedure?” In the world of modern medicine, the answer exists. It is sitting in the Electronic Health Record (EHR) system. However, getting that answer out is surprisingly difficult. The doctor cannot simply ask the database. Instead, they usually have to ask a data engineer, who then translates the request into a complex SQL query, executes it, checks the data, and sends it back. This loop is inefficient, slow, and creates a bottleneck that separates medical professionals from their own data. ...

[EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning 🔗](https://arxiv.org/abs/2410.13179)

Leveling Up: How Easy-to-Hard Masking Teaches AI to Understand Speech

Imagine you are preparing for a difficult exam. If you simply flip through your textbook and read random pages—some of which are blank, some containing trivial information you already know, and only a few containing complex concepts—your study session won’t be very efficient. A better strategy would be to identify the topics you find most difficult and focus your energy there. Furthermore, you wouldn’t start with the hardest problems on day one; you would start with the basics and progressively tackle harder questions as you get smarter. ...

[EFUF: Efficient Fine-Grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models 🔗](https://arxiv.org/abs/2402.09801)

How to Make AI "Unlearn" Hallucinations: A Deep Dive into EFUF

Introduction In the rapidly evolving landscape of Artificial Intelligence, Multimodal Large Language Models (MLLMs) like LLaVA and MiniGPT-4 represent a massive leap forward. These models don’t just read text; they “see” images and can converse about them. However, for all their impressive capabilities, MLLMs suffer from a persistent and frustrating glitch: hallucination. Hallucination occurs when the model confidently describes objects that simply aren’t there. Imagine showing an AI a picture of a living room, and it describes a “vintage red telephone” on the table when the table is empty. ...

[ECON: On the Detection and Resolution of Evidence Conflicts 🔗](https://arxiv.org/abs/2410.04068)

When Facts Collide: How LLMs Detect and Resolve Conflicting Evidence

In the age of Retrieval-Augmented Generation (RAG), we often view Large Language Models (LLMs) as sophisticated search engines that summarize the truth. We ask a question, the system retrieves documents from the web, and the LLM synthesizes an answer. But what happens when the internet disagrees with itself? Imagine asking, “Is coffee good for you?” One retrieved article cites a study claiming it reduces heart disease risk; another claims it causes hypertension. These are inter-evidence conflicts. They aren’t hallucinations by the model; they are contradictions inherent in the data sources. ...

[ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos 🔗](https://arxiv.org/abs/2410.09776)

Beyond "What is this?": Teaching AI to Ask Deep Questions About Videos

If you have ever searched for a tutorial on “how to fix a leaking faucet” or “history of the Roman Senate,” you have likely encountered the “People Also Ask” section on search engines. Increasingly, these suggestions point not just to text articles, but to specific chapters within videos. This feature is incredibly useful, but it presents a massive challenge for Artificial Intelligence: How can a machine watch a video and automatically generate meaningful, deep questions about the specific entities (people, places, concepts) discussed within it? ...

[ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? 🔗](https://arxiv.org/abs/2407.14044)

The Optimization Paradox: Why LLMs Struggle to Write Code That is Both Fast and Correct

Introduction In the rapidly evolving world of software development, Large Language Models (LLMs) like GPT-4, CodeLlama, and DeepSeek have become indispensable assistants. They can generate boilerplate code, debug errors, and even translate between programming languages. We have reached a point where generating functionally correct code—code that produces the right output for a given input—is a baseline expectation for these models. But any experienced developer knows that correctness is only half the battle. In real-world applications, especially in high-frequency trading, embedded systems, or large-scale data processing, efficiency is king. A sorting algorithm that works but takes three hours to run is often as useless as one that doesn’t work at all. ...

[EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees 🔗](https://arxiv.org/abs/2406.16858)

Breaking the Speed Limit: How EAGLE-2 Accelerates LLMs with Dynamic Draft Trees

If you have ever stared at a blinking cursor while ChatGPT or LLaMA writes a response word by word, you have experienced the inherent bottleneck of Large Language Models (LLMs). This sluggishness stems from autoregressive generation: the model must generate token A before it can generate token B, and token B before C. It is a strictly serial process that leaves modern GPUs—which thrive on parallel computation—largely underutilized. To solve this, researchers developed Speculative Sampling, a technique that allows models to “draft” several future tokens cheaply and verify them in parallel. However, most current speculative methods rely on rigid, static structures. They assume that predicting the next word is always equally difficult, regardless of the context. ...

[DynamicER: Resolving Emerging Mentions to Dynamic Entities for RAG 🔗](https://arxiv.org/abs/2410.11494)

When "The Angels' Superstar" Becomes "The Dodgers' Number 17": Solving Dynamic Entity Resolution in LLMs

Language is a living, breathing thing. It changes constantly, often faster than our digital systems can keep up. Consider the baseball superstar Shohei Ohtani. A few years ago, calling him “The Angels’ Ace” was accurate. Today, referencing him requires new language like “The Dodgers’ number 17.” For humans, this mental update is automatic. For Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems, it is a significant point of failure. If a user asks about “The Dodgers’ number 17,” but the knowledge base only recognizes Ohtani as an Angels player, the retrieval system fails to find the relevant documents. The result? The LLM hallucinates or provides outdated information. ...

[Dynamically rewarding with prompt optimization enables tuning-free self-alignment of language models 🔗](https://arxiv.org/abs/2411.08733)

Aligning LLMs Without Training: A Deep Dive into Dynamic Rewarding and Prompt Optimization (DRPO)

Introduction The rapid evolution of Large Language Models (LLMs) has brought us closer to artificial general intelligence, but raw intelligence isn’t enough. We need models that are aligned—helpful, harmless, and honest. Traditionally, achieving this alignment has been a resource-heavy endeavor. It typically involves Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from Human Feedback (RLHF). While effective, this pipeline is expensive, computationally demanding, and reliant on vast amounts of human-annotated preference data. ...

[Dynamica Multi-granularity Attribution Network for Aspect-based Sentiment Analysis 🔗](https://aclanthology.org/2024.emnlp-main.611.pdf)

Beyond Attention: Why Attribution is the Future of Aspect-Based Sentiment Analysis

In the world of Natural Language Processing (NLP), context is everything. Consider the sentence: “The food is pretty good, but the service is so horrific.” If you ask a standard sentiment analysis model to classify this sentence, it might get confused. Is it positive? Negative? Neutral? The reality is that it’s both—it depends entirely on what you are asking about. If you care about the food, it’s positive. If you care about the service, it’s negative. ...

[Dynamic Multi-Reward Weighting for Multi-Style Controllable Generation 🔗](https://arxiv.org/abs/2402.14146)

Mastering the Mix: How to Teach LLMs to Speak in Multiple Styles at Once

Imagine you are writing a performance review for a colleague. You want the feedback to be professional (formal), but you also want to be encouraging (positive). Now, imagine you are texting a close friend about a terrible movie you just saw. You want to be casual (informal) and critical (negative). As humans, we blend these stylistic dimensions effortlessly. We switch our “voice” based on the context, mixing sentiment, formality, humor, and politeness to suit the situation. Large Language Models (LLMs), however, often struggle with this nuance. While they are great at generating generic text, getting them to adhere to multiple specific stylistic constraints simultaneously—like “be formal AND negative AND ironic”—is a complex engineering challenge. ...

[DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models 🔗](https://arxiv.org/abs/2407.01009)

Replicating Human Intuition in LLMs: A Deep Dive into DynaThink

Introduction In his seminal book Thinking, Fast and Slow, Nobel laureate Daniel Kahneman describes two primary modes of human thought: “System 1,” which is fast, instinctive, and emotional; and “System 2,” which is slower, more deliberative, and more logical. When someone asks you “What is 2 + 2?”, you engage System 1. You don’t calculate it; you just know it. However, if asked “What is 17 × 24?”, you engage System 2. You pause, retrieve a mental algorithm, and process the problem step-by-step. ...

[DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities 🔗](https://arxiv.org/abs/2410.07722)

Bridging the Gap: How DyVo Injects World Knowledge into Neural Search

Search engines have come a long way from simply matching keywords, but they still struggle with a fundamental problem: ambiguity. When a user searches “Is the US a member of WHO?”, a traditional system sees the word “us” (the pronoun) and “who” (the question word), potentially missing the crucial entities “United States” and “World Health Organization.” This disconnection happens because many modern retrieval models rely on tokenization—breaking words down into smaller fragments called “word pieces.” While this helps computers handle rare words, it often shatters meaningful concepts into nonsensical syllables. ...