EMNLP 2024

[Bootstrapped Policy Learning for Task-oriented Dialogue through Goal Shaping 🔗](https://aclanthology.org/2024.emnlp-main.263.pdf)

Building the Ladder as You Climb: How Bootstrapped Policy Learning Solves Hard Dialogue Tasks

Introduction Imagine you are trying to teach a computer how to handle complex customer service calls—for example, booking a multi-leg flight while simultaneously reserving a hotel and buying tickets for a local attraction. In the world of Artificial Intelligence, specifically Task-Oriented Dialogue (ToD) systems, this is a massive challenge. The standard approach is Reinforcement Learning (RL). The AI talks to a user simulator, tries to fulfill the request, and gets a “reward” (a positive signal) only if it completes the entire task perfectly. If it fails, it gets nothing or a penalty. This is known as the sparse reward problem. It is akin to trying to learn how to play a piano concerto by hitting random keys and only being told “good job” if you accidentally play the whole piece perfectly on the first try. ...

[Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models? 🔗](https://arxiv.org/abs/2406.11375)

Can AI Teach AI? Using Analogies to Boost Scientific Understanding in Language Models

Introduction Imagine trying to explain the structure of an atom to someone who has never taken a physics class. You could recite a textbook definition about protons, neutrons, and electron shells. Or, you could say: “The atom is like a solar system. The nucleus is the sun in the center, and the electrons are planets orbiting around it.” For most learners, the second explanation—the analogy—is the one that clicks. Analogical reasoning is a cornerstone of human cognition. It allows us to map the familiar (the solar system) onto the unfamiliar (the atom), building a bridge to new understanding. ...

[Boosting Logical Fallacy Reasoning in LLMs via Logical Structure Tree 🔗](https://arxiv.org/abs/2410.12048)

Unmasking Bad Logic - How Structural Trees Help LLMs Detect Fallacies

Introduction In the age of information overload, the ability to distinguish between a sound argument and a deceptive one is more critical than ever. We often rely on Large Language Models (LLMs) to summarize news, analyze debates, or verify facts. However, while LLMs are incredibly fluent in generating text, they frequently struggle with the nuance of logical reasoning. They can be easily swayed by arguments that sound coherent but are structurally flawed. ...

[BlendFilter: Advancing Retrieval-Augmented Large Language Models via Query Generation Blending and Knowledge Filtering 🔗](https://arxiv.org/abs/2402.11129)

Beyond Simple RAG: Mastering Complex Queries with BlendFilter

Large Language Models (LLMs) have revolutionized how we process information, acting as capable assistants for summarization, dialogue, and question answering. However, anyone who has used them extensively knows their Achilles’ heel: they don’t know everything. Their knowledge is frozen in time at the moment of training, and they can confidently “hallucinate” incorrect facts. To solve this, the AI community adopted Retrieval-Augmented Generation (RAG). The idea is simple: before the LLM answers, it searches a database (like Wikipedia), finds relevant documents, and uses that information to generate an answer. ...

[Birdie: Advancing State Space Language Modeling with Dynamic Mixtures of Training Objectives 🔗](https://aclanthology.org/2024.emnlp-main.541.pdf)

Teaching State Space Models to Remember: How 'Birdie' Closes the Retrieval Gap with Transformers

Introduction In the current landscape of Natural Language Processing (NLP), the Transformer architecture reigns supreme. From ChatGPT to Llama, the mechanism of self-attention has unlocked incredible capabilities in generation and reasoning. However, this power comes with a significant computational cost. Attention scales quadratically with sequence length, and the Key-Value (KV) cache grows linearly, making the processing of massive contexts increasingly expensive for both training and deployment. This scaling bottleneck has reignited interest in efficient alternatives, specifically State Space Models (SSMs). Models like Mamba, S4, and Hawk promise the “holy grail” of sequence modeling: linear scaling and a fixed-size state that allows for constant-cost inference. In theory, they are the perfect solution for long-context applications. ...

[Bio-RFX: Refining Biomedical Extraction via Advanced Relation Classification and Structural Constraints 🔗](https://aclanthology.org/2024.emnlp-main.588.pdf)

Can AI Read Medical Journals Better Than Us? Understanding Bio-RFX

Introduction The rate at which biomedical literature is published is staggering. Every day, thousands of new papers are released, detailing novel drug interactions, genetic discoveries, and disease mechanisms. For researchers and clinicians, keeping up with this flood of information is impossible. Yet, hidden within these unstructured texts are the keys to new cures and therapies. To manage this, we rely on Information Extraction (IE)—using AI to automatically parse text and convert it into structured databases. This typically involves two steps: Named Entity Recognition (NER) (identifying distinct items like “Aspirin” or “TP53”) and Relation Extraction (RE) (determining how they interact, e.g., “Aspirin inhibits TP53”). ...

[BiasWipe: Mitigating Unintended Bias in Text Classifiers through Model Interpretability 🔗](https://aclanthology.org/2024.emnlp-main.1172.pdf)

BiasWipe: How to Surgically Remove Bias from LLMs Without Retraining

Introduction In the age of social media, automated content moderation is not just a luxury; it is a necessity. Platforms rely on sophisticated AI models to filter out toxic speech, harassment, and hate speech to keep online communities safe. However, these guardians of digital safety have a hidden flaw: they often become prejudiced themselves. Imagine a scenario where a user types, “I am a proud gay man.” A toxic content classifier might flag this as “Toxic” or “Hate Speech.” Why? Not because the sentiment is hateful, but because the model has learned a spurious correlation between the word “gay” and toxicity during its training. This phenomenon is known as unintended bias or false positive bias. ...

[BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs 🔗](https://arxiv.org/abs/2407.10241)

Can AI Police Itself? Inside BiasAlert, a New Framework for Detecting Social Bias in LLMs

Large Language Models (LLMs) like GPT-4 and Llama-2 have revolutionized how we interact with technology. They draft our emails, debug our code, and answer our most complex questions. However, these models are mirrors of the data they were trained on—data that reflects the internet, which unfortunately includes historical prejudices, stereotypes, and social biases. For researchers and developers, ensuring these models are fair is a massive priority. But there is a technical bottleneck: how do we actually measure fairness? Traditional methods rely on rigid templates or statistical probabilities that don’t reflect how we use AI today. We don’t use ChatGPT to fill in the blank of a sentence; we engage in open-ended conversation. ...

[Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models 🔗](https://arxiv.org/abs/2406.15718)

Breaking the Silence: How Duplex Models Are Ending Turn-Based AI Chat

Have you ever tried to interrupt a voice assistant? It usually goes something like this: you ask a question, realize you made a mistake mid-sentence, but the AI ignores your correction and continues to process your first request. You have to wait for it to finish a long monologue, or frantically hit a “stop” button, before you can try again. This awkward dance happens because almost all current Large Language Models (LLMs) operate on a turn-based mechanism. You speak, the model waits for you to finish, it processes, and then it speaks. It is the digital equivalent of using a walkie-talkie (“Over and out”). ...

[Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents 🔗](https://arxiv.org/abs/2409.15594)

Breaking the Silence: How SyncLLM Teaches AI to Interrupt, Listen, and Speak All at Once

Introduction: The “Walkie-Talkie” Problem If you have ever conversed with a voice assistant like Alexa, Siri, or current iterations of ChatGPT Voice, you have experienced a “half-duplex” interaction. Much like using a walkie-talkie, the protocol is rigid: you speak, you stop, the machine detects silence, processes your request, and finally responds. This turn-based exchange is functional, but it is distinctly non-human. Real human conversation is “full-duplex.” It is a chaotic, synchronized dance. We interrupt each other to clarify points. We offer verbal “backchannels” (like “uh-huh,” “right,” or “yeah”) while the other person is still talking to signal engagement. We anticipate what the other person is about to say before they finish their sentence. ...

[Beyond Reference: Evaluating High Quality Translations Better than Human References 🔗](https://aclanthology.org/2024.emnlp-main.294.pdf)

When the Machine Beats the Master: Fixing Reference Bias in Translation Metrics with RESUME

In the world of Machine Translation (MT), we have reached a fascinating tipping point. For decades, the goal of a translation system was to match human performance. Today, with the advent of Large Language Models (LLMs) like GPT-4, machine-generated translations often exceed the quality of human-written references. This creates a paradox in evaluation. Traditional metrics work by comparing the machine’s output (the “candidate”) to a human translation (the “reference”). If the reference is treated as the “Gold Standard,” how can a metric possibly reward a machine for writing something better? ...

[Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning 🔗](https://arxiv.org/abs/2411.00173)

Decoding the Black Box: How Dictionary Learning Makes Medical AI Transparent

In the high-stakes world of healthcare, Artificial Intelligence is rapidly becoming an indispensable tool. One of the most critical back-office tasks in medicine is medical coding—the translation of unstructured clinical text (like doctor’s notes) into standardized International Classification of Diseases (ICD) codes. These codes are vital for billing, epidemiology, and treatment tracking. While Large Language Models (LLMs) have shown incredible prowess in automating this task, they suffer from a significant “black box” problem. When an AI assigns a code for “postoperative wound infection,” a doctor needs to know why. If the model cannot explain its reasoning, it cannot be trusted in a clinical setting. ...

[Beyond Embeddings: The Promise of Visual Table in Visual Reasoning 🔗](https://arxiv.org/abs/2403.18252)

Visual Tables: Teaching AI to 'Read' Images Like a Database

Introduction Imagine showing an AI a picture of a U.S. five-dollar bill. A standard computer vision model looks at the pixels and recognizes patterns: it sees paper, a face, and numbers. It can tell you “this is a banknote.” But what if you ask, “Who is the person in the portrait, and what specific historical event did he lead the country through?” To answer this, the model needs more than just visual pattern matching. It needs world knowledge. It needs to know that the face belongs to Abraham Lincoln, that Lincoln was the 16th U.S. President, and that he led the U.S. through the Civil War. Standard visual embeddings—the vector representations models use to “see”—often fail to capture this depth of instance-level knowledge. ...

[Beyond Correlation: Interpretable Evaluation of Machine Translation Metric 🔗](https://arxiv.org/abs/2410.05183)

Decoding the Score: A New Framework for Interpretable Machine Translation Metrics

Introduction In the world of Machine Translation (MT), we have witnessed a massive shift from heuristic-based evaluation metrics (like BLEU) to neural-based metrics (like COMET and MetricX). These newer models are significantly better at aligning with human judgments. However, they come with a “black box” problem. When a neural metric hands you a score—say, 0.86—what does that actually mean? Is it a perfect translation? Is it just “okay”? If another metric gives the same sentence a -1.49, how do you compare them? ...

[Benchmarking Vision Language Models for Cultural Understanding 🔗](https://arxiv.org/abs/2407.10920)

Can AI Understand Culture? A Deep Dive into the CulturalVQA Benchmark

Introduction In recent years, Multimodal Vision-Language Models (VLMs) like GPT-4V and Gemini have demonstrated an astonishing ability to interpret images. They can identify objects, read text within photos, and describe complex scenes. However, recognizing a “wedding” is one thing; understanding the specific rituals, clothing, and traditions associated with a wedding in rural India versus one in Ethiopia is a different challenge entirely. As digital interactions become increasingly global, AI models must move beyond general object recognition to grasp cultural values—the shared beliefs, rituals, and traditions that define human societies. ...

[Belief Revision: The Adaptability of Large Language Models Reasoning 🔗](https://arxiv.org/abs/2406.19764)

Can AI Change Its Mind? Exploring Belief Revision in Large Language Models

Imagine you are told that “Tweety is a bird.” Based on your general knowledge, you logically infer that “Tweety flies.” But a moment later, you receive a new piece of information: “Tweety is a penguin.” What happens in your brain? You immediately revise your belief. You retract the conclusion that Tweety flies, but you maintain the premise that he is a bird. You have just performed belief revision—the cognitive ability to update your understanding when new evidence contradicts or contextualizes what you previously thought was true. ...

[Be Helpful but Don't Talk too Much - Enhancing Helpfulness in Conversations through Relevance in Multi-Turn Emotional Support 🔗](https://aclanthology.org/2024.emnlp-main.118.pdf)

The Goldilocks Principle of AI Therapy: Balancing Helpfulness and Cognitive Load

Imagine you are having a terrible day. You turn to a friend to vent about your stress. In response, they give you a single-word reply: “Okay.” You feel unheard and indifferent. Now, imagine the opposite scenario. You share your problem, and that same friend responds with a ten-minute, breathless monologue, analyzing every micro-factor of your situation, citing historical precedents, and offering fifteen different solution paths simultaneously. You feel overwhelmed. Instead of feeling supported, you are now exhausted. ...

[Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities 🔗](https://arxiv.org/abs/2404.14716)

Flipping the Script: How Bayesian Inverse Inference Supercharges In-Context Learning

Introduction In the rapidly evolving world of Artificial Intelligence, Large Language Models (LLMs) have gained a reputation for being quick learners. Specifically, they excel at In-Context Learning (ICL). This is the ability to adapt to a new task simply by seeing a few examples in the prompt, without requiring any updates to the model’s weights. Imagine you want an AI to translate English slang into formal text. You don’t need to retrain it; you just provide a few pairs: “Gonna -> Going to” and “Wanna -> Want to”, and the model figures out the pattern for the next input. ...

[Bayesian Calibration of Win Rate Estimation with LLM Evaluators 🔗](https://arxiv.org/abs/2411.04424)

Judging the Judges—How Bayesian Statistics Fixes LLM Evaluation

Judging the Judges: How Bayesian Statistics Fixes LLM Evaluation If you have played with ChatGPT, Claude, or Llama, you know that evaluating these models is tricky. Unlike a math test, there is no single “correct” answer for writing a poem, summarizing a news article, or chatting about philosophy. For a long time, the gold standard was human evaluation. You would generate two responses and ask a human, “Which one is better?” But human evaluation is slow, expensive, and not scalable. This led to the rise of LLM-as-a-judge: using a strong model (like GPT-4) to evaluate weaker models. It’s fast, cheap, and scales infinitely. ...

[BaitAttack: Alleviating Intention Shift in Jailbreak Attacks via Adaptive Bait Crafting 🔗](https://aclanthology.org/2024.emnlp-main.877.pdf)

Hook, Line, and Sinker: How 'BaitAttack' Manipulates LLMs into Breaking Their Own Rules

The rapid adoption of Large Language Models (LLMs) like GPT-4 and Llama-2 has brought with it a continuous arms race between safety alignment and adversarial attacks. We know LLMs are trained to refuse harmful instructions—if you ask a model “How do I build a bomb?”, it will politely decline. This is the “jailbreak” problem: finding a way to bypass these safety filters. Most research in this area focuses on disguise. Attackers wrap harmful queries in elaborate role-playing scenarios or logical puzzles to trick the model. However, a new paper titled “BaitAttack: Alleviating Intention Shift in Jailbreak Attacks via Adaptive Bait Crafting” highlights a critical flaw in current jailbreak methods: Intention Shift. ...