EMNLP 2024

[Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding 🔗](https://aclanthology.org/2024.emnlp-main.810.pdf)

Why CLIP Can't Read Between the Lines: Fixing Compositional Reasoning in Vision-Language Models

Introduction Imagine showing a picture of a horse riding on a person (a strange image, granted) to a state-of-the-art AI model. Then, you ask the model to pick the correct caption between two options: “a person riding a horse” and “a horse riding a person.” Ideally, this should be easy. The nouns are the same, but the relationship is flipped. However, most modern Vision-Language Models (VLMs), including the famous CLIP, struggle significantly with this. They act like “Bag-of-Words” models—they see “horse,” they see “person,” and they declare a match, completely ignoring the syntax or the relationship described by the verb “riding.” ...

[Interpretability-based Tailored Knowledge Editing in Transformers 🔗](https://aclanthology.org/2024.emnlp-main.225.pdf)

Surgical Precision for LLMs—How Tailored Knowledge Editing Fixes Facts Without Breaking Models

Large Language Models (LLMs) like GPT-4 or LLaMA are often described as modern-day encyclopedias. They store vast amounts of information about the world, from historical dates to scientific constants. But there is a fundamental flaw in this analogy: unlike a digital encyclopedia that can be updated with a few keystrokes, an LLM is frozen in time. What happens when the Prime Minister changes? What if the model learned incorrect information during training? Or worse, what if it memorized private user data that needs to be scrubbed? ...

[INTERINTENT: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context 🔗](https://arxiv.org/abs/2406.12203)

Can AI Keep a Secret? Testing Social Intelligence in the Game of Avalon

Large Language Models (LLMs) have mastered the art of conversation. They can write poetry, debug code, and summarize history. But can they lie strategically? Can they deduce who among their friends is a traitor? Can they understand the subtle difference between what someone says and what they actually intend? These capabilities fall under the umbrella of Social Intelligence. While we have plenty of benchmarks for math and coding, evaluating whether an AI can navigate complex social dynamics is much harder. Most current tests are static—multiple-choice questions that don’t reflect the fluid, high-stakes nature of real human interaction. ...

[Integrating Structural Semantic Knowledge for Enhanced Information Extraction Pre-training 🔗](https://aclanthology.org/2024.emnlp-main.129.pdf)

Beyond Plain Text — How SKIE Revolutionizes Information Extraction with Semantic Graphs

Introduction In the world of Natural Language Processing (NLP), understanding who did what to whom is the holy grail. This process, known as Information Extraction (IE), turns unstructured text—like a news article or a medical report—into structured data tables. For years, the standard approach has been to train massive language models on raw text. While models like BERT or RoBERTa are incredible at predicting the next word, they often treat sentences as linear sequences. They miss the hidden “skeleton” of language: the structural relationships between concepts. To fix this, researchers typically rely on heavily annotated datasets where humans manually label entities and relations. But this is expensive, slow, and hard to scale. ...

[Integrating Plutchik’s Theory with Mixture of Experts for Enhancing Emotion Classification 🔗](https://aclanthology.org/2024.emnlp-main.50.pdf)

When Psychology Meets AI: Teaching Models to Feel Using Plutchik’s Wheel and Mixture of Experts

Introduction In the world of Natural Language Processing (NLP), sentiment analysis has become a solved problem. Determining whether a movie review is positive or negative is something even basic models can handle with high accuracy. However, human experience is rarely just “positive” or “negative.” It is a kaleidoscope of joy, grief, anticipation, remorse, and awe. Detecting these fine-grained emotions in text remains a massive hurdle. For example, while models like RoBERTa can crush sentiment analysis benchmarks, their accuracy often plummets when asked to classify specific emotions in tweets or social media posts. Why? Because emotions are subjective, complex, and often overlapping. ...

[Integrating Argumentation and Hate-Speech-based Techniques for Counteracting Misinformation 🔗](https://aclanthology.org/2024.emnlp-main.622.pdf)

Beyond Fact-Checking: How AI Can Use Argumentation Strategies to Fight Misinformation

Introduction In the digital age, misinformation is a hydra. Cut off one head by flagging a post or banning a user, and two more appear in its place. We are witnessing a proliferation of false information that is not only annoying but potentially life-threatening, particularly in contexts like public health or crisis management. The current standard for dealing with this—content moderation—is largely reactive. Platforms wait for a report, check the content, and remove it. While this might stop the immediate spread, it does little to address the root cause: the perception of the person sharing the misinformation. If a user believes a falsehood and is simply silenced, their belief often hardens. They retreat to echo chambers, convinced of a conspiracy to silence the “truth.” ...

[IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning 🔗](https://arxiv.org/abs/2406.13683)

Beyond Black Boxes: How IntCoOp Teaches AI to 'Describe' Before It 'Classifies'

Beyond Black Boxes: How IntCoOp Teaches AI to “Describe” Before It “Classifies” In the rapidly evolving landscape of Artificial Intelligence, Vision-Language Models (VLMs) like CLIP have emerged as powerful foundation models. They can recognize objects, understand scenes, and even zero-shot classify categories they have never seen before. However, unlocking the full potential of these giants often requires a “magic spell”—a carefully crafted text prompt. Manual prompt engineering is tedious. While researchers have developed methods to automate this process (a technique known as prompt tuning), these methods often turn the model into a “black box,” learning abstract vectors that work mathematically but make no sense to humans. ...

[Language Models are Supervised Multitask Learners 🔗](https://arxiv.org/abs/2406.14491)

Rethinking Pre-Training: How Supervised Instruction Synthesis is Changing the LLM Landscape

The history of Large Language Models (LLMs) over the last few years has been dominated by a specific recipe: take a massive amount of raw text from the internet, train a model to predict the next token (unsupervised learning), and then, at the very end, fine-tune it to follow instructions (supervised learning). This recipe, popularized by models like GPT-2 and GPT-3, is known as “Vanilla Pre-Training.” It relies on the sheer scale of data. But there is a lingering hypothesis in the AI community: supervised multitask learning—where the model is explicitly told what task to perform—is actually a more efficient way to learn. The problem has always been scaling. We have petabytes of raw web text, but we don’t have petabytes of high-quality, human-labeled instruction-response pairs. ...

[Optimized Instruction Tuning of Specific Tasks 🔗](https://arxiv.org/abs/2404.16418)

Less is More: How Instruction-Only Task Selection Optimizes LLM Specialist Training

In the rapidly evolving landscape of Large Language Models (LLMs), we have seen a massive shift towards Instruction Tuning. Models like FLAN-T5 and T0 have demonstrated that training a model on a massive mixture of tasks—formatted as natural language instructions—unlocks incredible “zero-shot” capabilities. The prevailing wisdom has often been “the more tasks, the better.” The logic follows that a generalist model trained on thousands of tasks will be better equipped to handle a new, unseen task. ...

[Instruction Fine-Tuning: Does Prompt Loss Matter? 🔗](https://arxiv.org/abs/2401.13586)

The Forgotten Hyperparameter: Why Prompt Loss Matters in LLM Fine-Tuning

In the rapidly evolving world of Large Language Models (LLMs), “best practices” are often established not through rigorous ablation studies, but through community consensus and default library settings. One such standard in Supervised Instruction Fine-Tuning (SIFT) is prompt masking. When we fine-tune a model to follow instructions, the standard approach is to calculate the loss (the error) only on the completion (the model’s answer). We typically mask the prompt (the instruction and input), telling the model, “Don’t worry about predicting these tokens; just focus on generating the answer.” ...

[Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes 🔗](https://arxiv.org/abs/2410.05052)

Taming the Spike - How WeSaR Stabilizes LLM Training by scaling Weights

Training Large Language Models (LLMs) is an expensive, high-stakes endeavor. Imagine allocating thousands of GPUs and millions of dollars to train a model like LLaMA or GPT, only to have the training run diverge halfway through. The loss value shoots up suddenly, a phenomenon known as a loss spike, and weeks of progress can be ruined. Loss spikes are a fundamental issue in deep learning, particularly for Transformers. While engineers have developed various “band-aids”—like restarting training from a previous checkpoint or skipping data batches—the root causes remain partially obscure. ...

[Information Flow Routes: Automatically Interpreting Language Models at Scale 🔗](https://arxiv.org/abs/2403.00824)

Mapping the Mind of an LLM: How Information Flow Routes Reveal Model Inner Workings

The inner workings of Large Language Models (LLMs) often feel like a black box. We feed a prompt into one end, and a coherent response magically appears at the other. We know the architecture—Transformers, attention heads, feed-forward networks—but understanding exactly how a specific input token influences a specific output prediction remains one of the hardest challenges in AI research. Traditionally, researchers have tried to reverse-engineer these models using “circuits”—subgraphs of the model responsible for specific tasks. However, finding these circuits is usually a manual, labor-intensive process that requires human intuition to design specific test cases. ...

[InfiniPot: Infinite Context Processing on Memory-Constrained LLMs 🔗](https://arxiv.org/abs/2410.01518)

InfiniPot: How to Fit Infinite Context into Finite Memory

The promise of Large Language Models (LLMs) often feels boundless, but in practice, it is strictly limited by memory. Whether you are summarizing a massive legal contract, analyzing a full-length novel, or maintaining a chat history that spans weeks, you eventually hit a wall: the context window. For cloud-based giants like GPT-4 or Claude 3, simply throwing more GPUs at the problem can extend this window to 100K or even 1M tokens. But what happens when we want to bring this intelligence to the “edge”—to our laptops and mobile phones? In these memory-constrained environments, we cannot simply add more RAM. When the input sequence grows too long, the application crashes or slows to a crawl. ...

[Inference Helps PLMs' Conceptual Understanding: Improving the Abstract Inference Ability with Hierarchical Conceptual Entailment Graphs 🔗](https://aclanthology.org/2024.emnlp-main.1233.pdf)

Beyond Words: How HiCon-EG Teaches AI to Understand Concept Hierarchies

Introduction Imagine you read the sentence: “Mrs. Thompson gives her children some pasta.” As a human, your brain instantly performs a feat of abstraction. You understand that “pasta” is a type of “food.” Because you know she is giving “food,” you can infer a consequence: “The children are well-fed.” However, if you only viewed “pasta” as a physical object (an “entity”), the inference changes. If Mrs. Thompson gives an “entity,” the only safe inference is that the children “received an entity.” The nuance of feeding and being full is lost. ...

[InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance 🔗](https://arxiv.org/abs/2401.11206)

Can We Make AI Safe Without Retraining? Meet InferAligner

The explosion of Large Language Models (LLMs) has shifted the landscape of artificial intelligence. We have moved past the era where only tech giants could run these models. Today, open-source base models like LLaMA and Vicuna are readily available, allowing developers to fine-tune them for specific domains—be it finance, medicine, or mathematics. However, this democratization comes with a significant catch: Safety. When you take a base model and fine-tune it on a specific dataset (say, medical records), you run the risk of “catastrophic forgetting” regarding its safety protocols. A model that was once polite and harmless might, after fine-tuning, be tricked into generating malware code or hate speech. Traditionally, fixing this requires training-time alignment—processes like Reinforcement Learning from Human Feedback (RLHF). But RLHF is expensive, complex, and computationally heavy. ...

[Inductive-Deductive Strategy Reuse for Multi-Turn Instructional Dialogues 🔗](https://arxiv.org/abs/2404.11095)

How LLMs Can Learn to Ask Better Questions: The IDEAS Framework

Introduction In the rapidly evolving world of Large Language Models (LLMs), we often focus on how well a model answers a question. But there is another side to the coin that is equally critical for training these models: how well can a model ask questions? To align LLMs with human expectations, developers need massive datasets of high-quality, multi-turn dialogues. Manually collecting these conversations is expensive and slow. The solution? Use LLMs to generate the data themselves. One LLM plays the “System Agent” (the chatbot), and another plays the “User Simulator” (the human). ...

[INDUCT-LEARN: Short Phrase Prompting with Instruction Induction 🔗](https://aclanthology.org/2024.emnlp-main.297.pdf)

Stop Writing Long Prompts: How INDUCT-LEARN Automates Prompt Engineering

If you have ever spent hours tweaking a prompt for a Large Language Model (LLM)—changing a word here, adding a constraint there, trying to get the model to “think” correctly—you have experienced the bottleneck of prompt engineering. We know that LLMs are capable of incredible reasoning, but their performance is often highly sensitive to the instructions they receive. While techniques like “Chain-of-Thought” (CoT) prompting significantly improve performance, they usually require humans to manually write out detailed reasoning steps. This is time-consuming and requires expertise. ...

[Incubating Text Classifiers Following User Instructions with Nothing but LLM 🔗](https://arxiv.org/abs/2404.10877)

Building Custom Text Classifiers from Scratch: How 'Incubator' Turns LLMs into Data Generators

Introduction Imagine you need to build a text classifier for a very specific task. Perhaps you need to filter emails that are both “urgent” and “related to shipping,” or identify social media posts that are “sarcastic” versus “genuinely angry.” Traditionally, you had two difficult options. First, you could collect thousands of examples and label them by hand to train a model—a slow, expensive process. Second, you could try “zero-shot” classification using raw text mining, which involves searching massive databases for keywords. However, this often fails when concepts are complex or nuanced. ...

[Incomplete Utterance Rewriting with Editing Operation Guidance and Utterance Augmentation 🔗](https://arxiv.org/abs/2503.16043)

Teaching AI to Fill in the Blanks: A Graph-Based Approach to Incomplete Utterance Rewriting

Imagine you are texting a friend about a movie. Friend: “Have you seen Oppenheimer yet?” You: “Who is the director?” Friend: “Nolan.” You: “Oh, I love him.” To a human, this conversation is crystal clear. When you say “him,” you mean Christopher Nolan. When your friend says “Nolan,” they actually mean “Christopher Nolan is the director.” We constantly omit words (ellipsis) or use pronouns (coreference) because the context makes the meaning obvious. ...

[In-context Contrastive Learning for Event Causality Identification 🔗](https://arxiv.org/abs/2405.10512)

How Contrastive Learning is Revolutionizing Event Causality Identification

How Contrastive Learning is Revolutionizing Event Causality Identification Causality is the bedrock of how humans understand the world. If we see a glass fall, we anticipate it might break. If we read that a heavy rainstorm occurred, we understand why the flight was delayed. For Artificial Intelligence, however, making these connections—specifically determining if one event explicitly caused another based on text—is a significant challenge. This task is known as Event Causality Identification (ECI). ...