EMNLP 2024

[Dual-oriented Disentangled Network with Counterfactual Intervention for Multimodal Intent Detection 🔗](https://aclanthology.org/2024.emnlp-main.972.pdf)

Unraveling Intent: How Causal Inference and Disentanglement Improve Multimodal AI

In human communication, what we say is often less important than how we say it. A phrase like “Great job” can be a genuine compliment or a sarcastic critique depending on the speaker’s tone of voice and facial expression. For Artificial Intelligence, distinguishing between these nuances is the holy grail of Multimodal Intent Detection. To build systems that truly understand us—whether it’s a customer service bot or a smart home assistant—we need models that can process text, audio, and video simultaneously. While recent advances have improved how these modalities are fused, two significant problems remain: ...

[Dual-Space Knowledge Distillation for Large Language Models 🔗](https://arxiv.org/abs/2406.17328)

Bridging the Gap - How Dual-Space Knowledge Distillation Unifies Teacher and Student LLMs

Bridging the Gap: How Dual-Space Knowledge Distillation Unifies Teacher and Student LLMs The current era of Artificial Intelligence is defined by the “Scaling Law.” We have seen that increasing the parameter count of Large Language Models (LLMs) consistently yields better generalization and reasoning capabilities. However, this pursuit of intelligence comes with a hefty price tag. Models like LLaMA-70B or GPT-4 are massive, making them incredibly expensive to deploy and slow to run in real-world scenarios. ...

[Don't Just Say "I don't know"! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations 🔗](https://aclanthology.org/2024.emnlp-main.757.pdf)

Beyond 'I Don't Know': Teaching LLMs to Explain the Unknown

Beyond “I Don’t Know”: Teaching LLMs to Explain the Unknown We have all experienced that moment when interacting with a Large Language Model (LLM): you ask a question, and the model answers with absolute, unwavering confidence. It sounds plausible, the grammar is perfect, and the logic seems sound. But then you realize—it’s completely made up. This phenomenon, often called “hallucination,” is particularly dangerous when the model is faced with Unknown Questions. These are questions that don’t actually have a definitive answer. They might be based on false premises, ask about the future, or be linguistically ambiguous. ...

[Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration 🔗](https://aclanthology.org/2024.emnlp-main.976.pdf)

Beyond Ranking: Why Your LLM Should Care About the Magnitude of Rewards

Beyond Ranking: Why Your LLM Should Care About the Magnitude of Rewards If you have played around with Large Language Models (LLMs) like ChatGPT or Claude, you know that “alignment” is the secret sauce. A base model trained on the internet is a chaotic completion engine; it takes Reinforcement Learning from Human Feedback (RLHF) to turn that chaos into a helpful assistant. For a long time, the standard recipe for alignment was Proximal Policy Optimization (PPO). But PPO is complex, unstable, and computationally expensive. Recently, the field has shifted toward simpler, “order-based” methods like Direct Preference Optimization (DPO). These methods look at two answers—one good, one bad—and tell the model, “Prefer A over B.” ...

[Domain adapted machine translation: What does catastrophic forgetting forget and why? 🔗](https://arxiv.org/abs/2412.17537)

Catastrophic Forgetting in NMT: Why Your Medical Translator Forgot How to Say "Hello"

Imagine you have a brilliant translator who speaks fluent, general-purpose German and English. You want them to specialize in medical texts, so you send them to medical school (or, in machine learning terms, you fine-tune them on a medical dataset). They come back an expert in “myocardial infarctions” and “intravenous drips.” But then, you ask them to translate a simple news article about a football game. Suddenly, they start translating “game” as “experiment match,” “player” as “subject,” and they seem to have completely forgotten common words they knew just weeks ago. ...

[DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging 🔗](https://arxiv.org/abs/2407.01470)

DogeRM: How to Teach Reward Models New Tricks Without New Data

DogeRM: How to Teach Reward Models New Tricks Without New Data In the rapidly evolving world of Large Language Models (LLMs), we have witnessed titans like GPT-4 and Gemini perform incredible feats, from writing poetry to solving complex coding problems. But raw intelligence isn’t enough; these models need to be aligned with human intent. We want them to be helpful, harmless, and honest. The standard way to achieve this is through Reinforcement Learning from Human Feedback (RLHF). Central to this process is a component called the Reward Model (RM)—a digital judge that scores the AI’s responses. ...

[Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? 🔗](https://arxiv.org/abs/2406.14492)

Busting the Myth: Does Object Grounding Actually Fix LVLM Hallucinations?

Imagine asking an advanced AI to describe a picture of a living room. The AI confidently tells you about a “sleeping black cat on the sofa.” You look at the image. There is a sofa, but there is absolutely no cat. This phenomenon is known as object hallucination. It is one of the most persistent and frustrating hurdles in the development of Large Vision-Language Models (LVLMs). These models, which power tools like GPT-4V or LLaVA, have demonstrated incredible capabilities in understanding visual scenes. Yet, their tendency to “invent” objects erodes user trust and limits their deployment in critical fields like robotics or medical imaging. ...

[Does Large Language Model Contain Task-Specific Neurons? 🔗](https://aclanthology.org/2024.emnlp-main.403.pdf)

The Brain Within the Machine: Hunting for Task-Specific Neurons in LLMs

The Brain Within the Machine: Hunting for Task-Specific Neurons in LLMs When we think about the human brain, we often think in terms of specialization. Neuroscience has long established that specific regions of our brain are responsible for distinct functions—the frontal lobe handles reasoning and decision-making, while other areas manage language processing or motor skills. For years, researchers have wondered if Large Language Models (LLMs) operate on a similar principle. We know LLMs are incredibly versatile; a single model like Llama-2 can translate French, summarize a legal document, and analyze the sentiment of a tweet. But how does it manage this switching? Does the entire neural network fire for every request, or are there specific “circuits” dedicated to specific tasks? ...

[DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models 🔗](https://arxiv.org/abs/2410.03061)

Teaching Small Models to Read: How DocKD Distills LLM Knowledge for Document Understanding

Introduction In the world of Artificial Intelligence, document understanding—the ability for a machine to read, interpret, and extract data from scanned PDFs, forms, and invoices—is a massive bottleneck. While we have powerful Large Language Models (LLMs) like GPT-4 or Claude, they are computationally expensive and slow to run on millions of documents. Ideally, we want smaller, faster models (Student models) that can do the job just as well. However, training these smaller models usually requires massive datasets labeled by humans. This is slow, expensive, and rigid. If you train a model on receipts, it fails on medical forms. This is the Open-World Document Understanding problem: how do we create models that can handle document types they’ve never seen before, without needing a human to label thousands of new examples? ...

[DocHieNet: A Large and Diverse Dataset for Document Hierarchy Parsing 🔗](https://aclanthology.org/2024.emnlp-main.65.pdf)

Taming the Document Jungle: How DocHieNet and DHFormer Unlock Hierarchical Structure in PDFs

In the modern digital landscape, we are swimming in a sea of documents. Every day, millions of PDFs, scanned images, and slides are generated. To a human, these documents have a clear structure: a title at the top, followed by sections, subsections, paragraphs, and figures. We intuitively understand that a “Section Header” is the parent of the “Paragraph” beneath it. To a computer, however, a scanned PDF is often just a bag of pixels or, at best, a disorganized stream of text and bounding boxes. ...

[DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding 🔗](https://arxiv.org/abs/2410.16472)

Beyond Pixel Editing: How DocEdit-v2 Mastered Document Structure with LMMs

Have you ever tried to edit a scanned document or a PDF where the source file was lost? It is often a frustrating experience. You might want to move a paragraph, change a header level, or update a table value. In a word processor, this is trivial. But in a document image, these elements are just pixels. Recent advances in Generative AI, particularly diffusion models, have revolutionized image creation. However, they struggle significantly with document editing. If you ask a standard image generator to “change the date in the header,” it might smudge the text, hallucinate new letters, or destroy the table alignment. Documents are not just collections of pixels; they are structured information—text, layout, styling, and hierarchy. ...

[DocCGen: Document-based Controlled Code Generation 🔗](https://arxiv.org/abs/2406.11925)

Taming the Hallucinations: Mastering Domain-Specific Code Generation with DocCGen

Introduction If you have ever used tools like GitHub Copilot or Amazon CodeWhisperer, you know the magic of watching a Large Language Model (LLM) turn a simple comment into a functioning Python function or a complex Java class. These models, trained on massive repositories of general-purpose code, have revolutionized software development. However, the magic often fades when you step away from mainstream languages like Python or C++ and into the world of Domain-Specific Languages (DSLs). ...

[Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA 🔗](https://arxiv.org/abs/2410.06524)

Beyond the Hype: Dissecting Human vs. AI Intelligence with CAIMIRA

The narrative of Artificial Intelligence in recent years has been dominated by a single, loud proclamation: supremacy. We hear that Large Language Models (LLMs) like GPT-4 are passing bar exams, acing medical boards, and crushing SATs. The implication is that AI has not only caught up to human intelligence but has begun to lap it. But is this actually true? Or are we mistaking memorization for reasoning? While an AI might defeat a human on a standardized test, does it solve the problem in the same way a human does? To answer this, we need to look beyond simple accuracy scores. We need to understand the latent skills required to answer questions and measure how humans and AIs differ in possessing them. ...

[Do You Know What You Are Talking About? Characterizing Query-Knowledge Relevance For Reliable Retrieval Augmented Generation 🔗](https://arxiv.org/abs/2410.08320)

RAG's Reality Check: How Statistical Testing Can Stop Hallucinations

Large Language Models (LMs) have a reputation for being confident, articulate, and occasionally, completely wrong. This phenomenon, known as hallucination, is a significant barrier to deploying AI in safety-critical fields like healthcare or finance. To combat this, the industry has largely adopted Retrieval Augmented Generation (RAG). The premise of RAG is simple: instead of relying solely on the LM’s internal memory, we give the model an open-book test. We retrieve relevant documents from a trusted external database and ask the model to answer based on that information. Theoretically, this anchors the model in reality. ...

[Do We Need Language-Specific Fact-Checking Models? The Case of Chinese 🔗](https://arxiv.org/abs/2401.15498)

Lost in Translation: Why We Need Native AI for Fact-Checking Chinese

Introduction In the fight against the global “infodemic,” automated fact-checking has become an essential tool. We rely on these systems to sift through massive amounts of data, identifying misinformation faster than any human could. However, there is a significant imbalance in the current landscape: the vast majority of research, datasets, and models are built for English. This raises a critical question for the AI community: Can we simply translate claims from other languages into English to use our existing robust tools? Or, alternatively, can we rely on massive multilingual Large Language Models (LLMs) like GPT-4 to handle verification across all languages? ...

[Do Text-to-Vis Benchmarks Test Real Use of Visualisations? 🔗](https://arxiv.org/abs/2407.19726)

Are We Testing AI Visualizations Wrong? A Reality Check for Text-to-Vis Benchmarks

In the rapidly evolving world of Large Language Models (LLMs), the ability to turn a simple text prompt into a visual graph is a “killer app.” Imagine typing “Show me the sales trend over the last five years compared to marketing spend,” and having an AI instantly generate the perfect Python code to render that chart. This task is known as Text-to-Vis. To build these systems, researchers rely on benchmarks—standardized datasets used to train and evaluate performance. But here is the critical question: Do these benchmarks actually reflect how human beings create visualizations in the real world? ...

[Do Large Language Models Know How Much They Know? 🔗](https://arxiv.org/abs/2502.19573)

The Metacognition of AI: Do LLMs Know When to Stop Talking?

Large Language Models (LLMs) are famously knowledgeable. Ask them about the capital of France, the history of the Roman Empire, or the syntax of Python, and they will likely give you a correct answer. However, a lingering question in the field of AI safety and reliability is not just about what models know, but whether they know what they know. More specifically: Does an LLM understand the scope of its own knowledge? ...

[Do LLMs suffer from Multi-Party Hangover? A Diagnostic Approach to Addressee Recognition and Response Selection in Conversations 🔗](https://arxiv.org/abs/2409.18602)

Diagnosing the Multi-Party Hangover: Can LLMs Handle Complex Group Chats?

We have all been there: a chaotic group chat on WhatsApp, Slack, or Discord. Multiple conversations happen simultaneously, people reply to messages from three hours ago, and users jump in and out of the discussion. Navigating this web of interactions requires more than just understanding language; it requires understanding structure. You need to know who is talking to whom to make sense of the “what.” Large Language Models (LLMs) like GPT-4 or Llama-2 excel at one-on-one dialogues. But do they struggle when the room gets crowded? ...

[Do LLMs learn a true syntactic universal? 🔗](https://aclanthology.org/2024.emnlp-main.950.pdf)

The Basque Problem: Do AI Models Actually Understand Universal Grammar?

The debate over Artificial Intelligence and language is often framed as a battle between “nature” and “nurture.” On one side, you have the nativist view, championed historically by linguists like Noam Chomsky. This view argues that human beings are born with an innate “Universal Grammar”—a set of hard-wired constraints that allow children to learn complex languages from relatively little data. On the other side, you have the empiricist view, currently dominating the field of Deep Learning. This view posits that general-purpose learning algorithms (like Transformers), given enough data, can learn anything, including the complex rules of syntax, without any pre-wired grammatical knowledge. ...

[Do LLMs Plan Like Human Writers? Comparing Journalist Coverage of Press Releases with LLMs 🔗](https://aclanthology.org/2024.emnlp-main.1216.pdf)

The Creativity Gap: Why LLMs Struggle to Plan Like Human Journalists

In the age of generative AI, it is easy to be impressed by a Large Language Model’s ability to write fluent text. Ask ChatGPT to write a news article, and it will churn out grammatically correct, structurally sound paragraphs in seconds. But journalism is not just about writing; it is about reporting. Before a single sentence is written, a journalist engages in a complex planning process. They decide on an “angle” (the specific narrative focus) and figure out which “sources” (people or documents) to consult. This is especially critical when covering press releases—corporate announcements often designed to “spin” a story in a positive light. A good journalist doesn’t just repeat the spin; they challenge it. ...