EMNLP 2024

[INTERINTENT: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context 🔗](https://arxiv.org/abs/2406.12203)

Can AI Keep a Secret? Testing Social Intelligence in the Game of Avalon

Large Language Models (LLMs) have mastered the art of conversation. They can write poetry, debug code, and summarize history. But can they lie strategically? Can they deduce who among their friends is a traitor? Can they understand the subtle difference between what someone says and what they actually intend? These capabilities fall under the umbrella of Social Intelligence. While we have plenty of benchmarks for math and coding, evaluating whether an AI can navigate complex social dynamics is much harder. Most current tests are static—multiple-choice questions that don’t reflect the fluid, high-stakes nature of real human interaction. ...

[Integrating Structural Semantic Knowledge for Enhanced Information Extraction Pre-training 🔗](https://aclanthology.org/2024.emnlp-main.129.pdf)

Beyond Plain Text — How SKIE Revolutionizes Information Extraction with Semantic Graphs

Introduction In the world of Natural Language Processing (NLP), understanding who did what to whom is the holy grail. This process, known as Information Extraction (IE), turns unstructured text—like a news article or a medical report—into structured data tables. For years, the standard approach has been to train massive language models on raw text. While models like BERT or RoBERTa are incredible at predicting the next word, they often treat sentences as linear sequences. They miss the hidden “skeleton” of language: the structural relationships between concepts. To fix this, researchers typically rely on heavily annotated datasets where humans manually label entities and relations. But this is expensive, slow, and hard to scale. ...

[Integrating Plutchik’s Theory with Mixture of Experts for Enhancing Emotion Classification 🔗](https://aclanthology.org/2024.emnlp-main.50.pdf)

When Psychology Meets AI: Teaching Models to Feel Using Plutchik’s Wheel and Mixture of Experts

Introduction In the world of Natural Language Processing (NLP), sentiment analysis has become a solved problem. Determining whether a movie review is positive or negative is something even basic models can handle with high accuracy. However, human experience is rarely just “positive” or “negative.” It is a kaleidoscope of joy, grief, anticipation, remorse, and awe. Detecting these fine-grained emotions in text remains a massive hurdle. For example, while models like RoBERTa can crush sentiment analysis benchmarks, their accuracy often plummets when asked to classify specific emotions in tweets or social media posts. Why? Because emotions are subjective, complex, and often overlapping. ...

[Integrating Argumentation and Hate-Speech-based Techniques for Counteracting Misinformation 🔗](https://aclanthology.org/2024.emnlp-main.622.pdf)

Beyond Fact-Checking: How AI Can Use Argumentation Strategies to Fight Misinformation

Introduction In the digital age, misinformation is a hydra. Cut off one head by flagging a post or banning a user, and two more appear in its place. We are witnessing a proliferation of false information that is not only annoying but potentially life-threatening, particularly in contexts like public health or crisis management. The current standard for dealing with this—content moderation—is largely reactive. Platforms wait for a report, check the content, and remove it. While this might stop the immediate spread, it does little to address the root cause: the perception of the person sharing the misinformation. If a user believes a falsehood and is simply silenced, their belief often hardens. They retreat to echo chambers, convinced of a conspiracy to silence the “truth.” ...

[IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning 🔗](https://arxiv.org/abs/2406.13683)

Beyond Black Boxes: How IntCoOp Teaches AI to 'Describe' Before It 'Classifies'

Beyond Black Boxes: How IntCoOp Teaches AI to “Describe” Before It “Classifies” In the rapidly evolving landscape of Artificial Intelligence, Vision-Language Models (VLMs) like CLIP have emerged as powerful foundation models. They can recognize objects, understand scenes, and even zero-shot classify categories they have never seen before. However, unlocking the full potential of these giants often requires a “magic spell”—a carefully crafted text prompt. Manual prompt engineering is tedious. While researchers have developed methods to automate this process (a technique known as prompt tuning), these methods often turn the model into a “black box,” learning abstract vectors that work mathematically but make no sense to humans. ...

[Language Models are Supervised Multitask Learners 🔗](https://arxiv.org/abs/2406.14491)

Rethinking Pre-Training: How Supervised Instruction Synthesis is Changing the LLM Landscape

The history of Large Language Models (LLMs) over the last few years has been dominated by a specific recipe: take a massive amount of raw text from the internet, train a model to predict the next token (unsupervised learning), and then, at the very end, fine-tune it to follow instructions (supervised learning). This recipe, popularized by models like GPT-2 and GPT-3, is known as “Vanilla Pre-Training.” It relies on the sheer scale of data. But there is a lingering hypothesis in the AI community: supervised multitask learning—where the model is explicitly told what task to perform—is actually a more efficient way to learn. The problem has always been scaling. We have petabytes of raw web text, but we don’t have petabytes of high-quality, human-labeled instruction-response pairs. ...

[Optimized Instruction Tuning of Specific Tasks 🔗](https://arxiv.org/abs/2404.16418)

Less is More: How Instruction-Only Task Selection Optimizes LLM Specialist Training

In the rapidly evolving landscape of Large Language Models (LLMs), we have seen a massive shift towards Instruction Tuning. Models like FLAN-T5 and T0 have demonstrated that training a model on a massive mixture of tasks—formatted as natural language instructions—unlocks incredible “zero-shot” capabilities. The prevailing wisdom has often been “the more tasks, the better.” The logic follows that a generalist model trained on thousands of tasks will be better equipped to handle a new, unseen task. ...

[Instruction Fine-Tuning: Does Prompt Loss Matter? 🔗](https://arxiv.org/abs/2401.13586)

The Forgotten Hyperparameter: Why Prompt Loss Matters in LLM Fine-Tuning

In the rapidly evolving world of Large Language Models (LLMs), “best practices” are often established not through rigorous ablation studies, but through community consensus and default library settings. One such standard in Supervised Instruction Fine-Tuning (SIFT) is prompt masking. When we fine-tune a model to follow instructions, the standard approach is to calculate the loss (the error) only on the completion (the model’s answer). We typically mask the prompt (the instruction and input), telling the model, “Don’t worry about predicting these tokens; just focus on generating the answer.” ...

[Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes 🔗](https://arxiv.org/abs/2410.05052)

Taming the Spike - How WeSaR Stabilizes LLM Training by scaling Weights

Training Large Language Models (LLMs) is an expensive, high-stakes endeavor. Imagine allocating thousands of GPUs and millions of dollars to train a model like LLaMA or GPT, only to have the training run diverge halfway through. The loss value shoots up suddenly, a phenomenon known as a loss spike, and weeks of progress can be ruined. Loss spikes are a fundamental issue in deep learning, particularly for Transformers. While engineers have developed various “band-aids”—like restarting training from a previous checkpoint or skipping data batches—the root causes remain partially obscure. ...

[Information Flow Routes: Automatically Interpreting Language Models at Scale 🔗](https://arxiv.org/abs/2403.00824)

Mapping the Mind of an LLM: How Information Flow Routes Reveal Model Inner Workings

The inner workings of Large Language Models (LLMs) often feel like a black box. We feed a prompt into one end, and a coherent response magically appears at the other. We know the architecture—Transformers, attention heads, feed-forward networks—but understanding exactly how a specific input token influences a specific output prediction remains one of the hardest challenges in AI research. Traditionally, researchers have tried to reverse-engineer these models using “circuits”—subgraphs of the model responsible for specific tasks. However, finding these circuits is usually a manual, labor-intensive process that requires human intuition to design specific test cases. ...

[InfiniPot: Infinite Context Processing on Memory-Constrained LLMs 🔗](https://arxiv.org/abs/2410.01518)

InfiniPot: How to Fit Infinite Context into Finite Memory

The promise of Large Language Models (LLMs) often feels boundless, but in practice, it is strictly limited by memory. Whether you are summarizing a massive legal contract, analyzing a full-length novel, or maintaining a chat history that spans weeks, you eventually hit a wall: the context window. For cloud-based giants like GPT-4 or Claude 3, simply throwing more GPUs at the problem can extend this window to 100K or even 1M tokens. But what happens when we want to bring this intelligence to the “edge”—to our laptops and mobile phones? In these memory-constrained environments, we cannot simply add more RAM. When the input sequence grows too long, the application crashes or slows to a crawl. ...

[Inference Helps PLMs' Conceptual Understanding: Improving the Abstract Inference Ability with Hierarchical Conceptual Entailment Graphs 🔗](https://aclanthology.org/2024.emnlp-main.1233.pdf)

Beyond Words: How HiCon-EG Teaches AI to Understand Concept Hierarchies

Introduction Imagine you read the sentence: “Mrs. Thompson gives her children some pasta.” As a human, your brain instantly performs a feat of abstraction. You understand that “pasta” is a type of “food.” Because you know she is giving “food,” you can infer a consequence: “The children are well-fed.” However, if you only viewed “pasta” as a physical object (an “entity”), the inference changes. If Mrs. Thompson gives an “entity,” the only safe inference is that the children “received an entity.” The nuance of feeding and being full is lost. ...

[InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance 🔗](https://arxiv.org/abs/2401.11206)

Can We Make AI Safe Without Retraining? Meet InferAligner

The explosion of Large Language Models (LLMs) has shifted the landscape of artificial intelligence. We have moved past the era where only tech giants could run these models. Today, open-source base models like LLaMA and Vicuna are readily available, allowing developers to fine-tune them for specific domains—be it finance, medicine, or mathematics. However, this democratization comes with a significant catch: Safety. When you take a base model and fine-tune it on a specific dataset (say, medical records), you run the risk of “catastrophic forgetting” regarding its safety protocols. A model that was once polite and harmless might, after fine-tuning, be tricked into generating malware code or hate speech. Traditionally, fixing this requires training-time alignment—processes like Reinforcement Learning from Human Feedback (RLHF). But RLHF is expensive, complex, and computationally heavy. ...

[Inductive-Deductive Strategy Reuse for Multi-Turn Instructional Dialogues 🔗](https://arxiv.org/abs/2404.11095)

How LLMs Can Learn to Ask Better Questions: The IDEAS Framework

Introduction In the rapidly evolving world of Large Language Models (LLMs), we often focus on how well a model answers a question. But there is another side to the coin that is equally critical for training these models: how well can a model ask questions? To align LLMs with human expectations, developers need massive datasets of high-quality, multi-turn dialogues. Manually collecting these conversations is expensive and slow. The solution? Use LLMs to generate the data themselves. One LLM plays the “System Agent” (the chatbot), and another plays the “User Simulator” (the human). ...

[INDUCT-LEARN: Short Phrase Prompting with Instruction Induction 🔗](https://aclanthology.org/2024.emnlp-main.297.pdf)

Stop Writing Long Prompts: How INDUCT-LEARN Automates Prompt Engineering

If you have ever spent hours tweaking a prompt for a Large Language Model (LLM)—changing a word here, adding a constraint there, trying to get the model to “think” correctly—you have experienced the bottleneck of prompt engineering. We know that LLMs are capable of incredible reasoning, but their performance is often highly sensitive to the instructions they receive. While techniques like “Chain-of-Thought” (CoT) prompting significantly improve performance, they usually require humans to manually write out detailed reasoning steps. This is time-consuming and requires expertise. ...

[Incubating Text Classifiers Following User Instructions with Nothing but LLM 🔗](https://arxiv.org/abs/2404.10877)

Building Custom Text Classifiers from Scratch: How 'Incubator' Turns LLMs into Data Generators

Introduction Imagine you need to build a text classifier for a very specific task. Perhaps you need to filter emails that are both “urgent” and “related to shipping,” or identify social media posts that are “sarcastic” versus “genuinely angry.” Traditionally, you had two difficult options. First, you could collect thousands of examples and label them by hand to train a model—a slow, expensive process. Second, you could try “zero-shot” classification using raw text mining, which involves searching massive databases for keywords. However, this often fails when concepts are complex or nuanced. ...

[Incomplete Utterance Rewriting with Editing Operation Guidance and Utterance Augmentation 🔗](https://arxiv.org/abs/2503.16043)

Teaching AI to Fill in the Blanks: A Graph-Based Approach to Incomplete Utterance Rewriting

Imagine you are texting a friend about a movie. Friend: “Have you seen Oppenheimer yet?” You: “Who is the director?” Friend: “Nolan.” You: “Oh, I love him.” To a human, this conversation is crystal clear. When you say “him,” you mean Christopher Nolan. When your friend says “Nolan,” they actually mean “Christopher Nolan is the director.” We constantly omit words (ellipsis) or use pronouns (coreference) because the context makes the meaning obvious. ...

[In-context Contrastive Learning for Event Causality Identification 🔗](https://arxiv.org/abs/2405.10512)

How Contrastive Learning is Revolutionizing Event Causality Identification

How Contrastive Learning is Revolutionizing Event Causality Identification Causality is the bedrock of how humans understand the world. If we see a glass fall, we anticipate it might break. If we read that a heavy rainstorm occurred, we understand why the flight was delayed. For Artificial Intelligence, however, making these connections—specifically determining if one event explicitly caused another based on text—is a significant challenge. This task is known as Event Causality Identification (ECI). ...

[In-Context Compositional Generalization for Large Vision-Language Models 🔗](https://aclanthology.org/2024.emnlp-main.996.pdf)

Beyond Simple Similarity: How to Teach Vision-Language Models to Generalize Compositionally

Introduction Imagine you are teaching a child what a “red apple” is. You show them a picture of a red apple. Now, you want them to understand a “green chair.” You show them a green chair. Finally, you present them with a “green apple”—an object they haven’t explicitly studied before, but which is composed of concepts they already know (“green” and “apple”). If the child recognizes it, they have demonstrated Compositional Generalization. ...

[In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search 🔗](https://arxiv.org/abs/2311.07237)

When LLMs Fail: Exploring the Long-Tail of Knowledge with Logic and Search

Large Language Models (LLMs) like GPT-4 and Llama 2 have dazzled the world with their ability to write code, compose poetry, and answer complex questions. But there is a catch: these models perform best when they are on “familiar ground.” When you ask an LLM about popular topics—like the iPhone or major historical events—it shines. But what happens when you push the model into the obscure corners of knowledge, known as the long-tail distribution? ...