EMNLP 2024

[COMPACT: Compressing Retrieved Documents Actively for Question Answering 🔗](https://arxiv.org/abs/2407.09014)

Squeezing the Truth: How COMPACT Makes RAG Smarter and Faster

Introduction In the rapidly evolving world of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become the gold standard for grounding AI responses in reality. By fetching relevant documents from an external database, we can prevent hallucinations and give models access to up-to-date information. However, there is a catching point: the “context window.” While modern models boast about handling 100k or even 1 million tokens, filling that context comes with significant downsides. It is expensive, increases latency, and paradoxically, often confuses the model. Known as the “Lost in the Middle” phenomenon, LLMs struggle to find specific needles in massive haystacks of retrieved text. ...

[COMMUNITY-CROSS-INSTRUCT: Unsupervised Instruction Generation for Aligning Large Language Models to Online Communities 🔗](https://arxiv.org/abs/2406.12074)

Building Digital Twins: How to Align LLMs with Online Communities Without Human Supervision

Building Digital Twins: How to Align LLMs with Online Communities Without Human Supervision Imagine you are a social scientist trying to understand how different political groups feel about a new tax policy, or a public health official tracking emerging diet trends. Traditionally, you have two options: run a survey or conduct focus groups. Both are slow, expensive, and plagued by biases. People might lie to look better (social desirability bias) or simply refuse to participate (non-response bias). ...

[Communicating with Speakers and Listeners of Different Pragmatic Levels 🔗](https://arxiv.org/abs/2410.05851)

Can You Hear What I Didn't Say? Modeling Pragmatic Reasoning in AI Communication

Imagine you are sitting at a table with three objects: a red circle, a red square, and a gray circle. Someone points toward the table and says, “The red one!” Strictly speaking, this sentence is ambiguous. There are two red objects. However, most humans would immediately reach for the red circle. Why? Because if the speaker wanted the red square, they likely would have said “The square,” since that is a unique feature. The fact that they used color implies they are distinguishing it from the other shape of the same color or the other object of the same shape. ...

[Commonsense Knowledge Editing Based on Free-Text in LLMs 🔗](https://arxiv.org/abs/2410.23844)

Teaching Common Sense to LLMs—Why Fact-Based Editing Isn't Enough

Introduction Large Language Models (LLMs) like GPT-4 and LLaMA are impressive, but they are not perfect. They can hallucinate, rely on outdated information, or simply lack specific context. In recent years, researchers have developed “Knowledge Editing” techniques—surgical methods to update a model’s weights to fix a specific error without retraining the entire network. Traditionally, this has been applied to factual knowledge. For example, if the Prime Minister of a country changes, we can edit the model to associate the country with the new leader. However, the real world isn’t just made of static facts. It is filled with commonsense knowledge—the intuitive understanding of how people act, how physics works, and social norms. ...

[CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions 🔗](https://arxiv.org/abs/2410.03077)

Stop Confusing Your LLM: How Grouping Data Enhances Instruction Tuning

Introduction The rise of Large Language Models (LLMs) like ChatGPT and LLaMa has shifted the focus of AI research from merely creating architectures to refining how these models learn. We know that “Pre-training” gives a model its vast knowledge base, but “Instruction Tuning” (IT) is what makes it helpful. IT is the process that teaches the model to follow specific user commands, transforming it from a text predictor into a capable assistant. ...

[CommVQA: Situating Visual Question Answering in Communicative Contexts 🔗](https://arxiv.org/abs/2402.15002)

Why Context Matters: Reimagining Visual Question Answering with CommVQA

Imagine you are looking at a picture of a mountain range. If you found this image on a travel blog, you might ask: “Where is this located?” or “How difficult is the hike?” However, if you encountered the exact same image in a science magazine, your questions would likely shift to: “Is this a volcanic range?” or “How were these peaks formed?” This simple thought experiment highlights a fundamental aspect of human communication: our questions are rarely generated in a vacuum. They are shaped by our goals, our environment, and the information we already possess. ...

[Collective Critics for Creative Story Generation 🔗](https://arxiv.org/abs/2410.02428)

Can AI Be Creative? How 'Collective Critics' Are Teaching LLMs to Write Better Stories

If you have ever asked ChatGPT or Llama to write a story, you have likely encountered a specific problem. The output is usually coherent; the grammar is perfect, the sequence of events makes sense, and the characters do what they are supposed to do. But it is often… boring. It lacks the spark, the clever twist, or the vivid imagery that makes human writing gripping. In the field of Natural Language Processing (NLP), this is a known trade-off. We have become very good at coherence (logic and flow), but we are still struggling with creativity (novelty, surprise, and emotional resonance). ...

[Collaborative Performance Prediction for Large Language Models 🔗](https://arxiv.org/abs/2407.01300)

Beyond Scaling Laws: How Netflix-Style Algorithms Can Predict LLM Performance

Introduction In the rapidly evolving landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4, Llama 3, and Claude have become the engines driving innovation. However, a significant bottleneck hampers the progress of researchers and engineers alike: the exorbitant cost of evaluation. To truly understand a model’s capabilities, it must be tested against massive benchmarks—suites of tasks ranging from coding problems to complex reasoning and creative writing. Running a single LLM through a comprehensive benchmark can cost upwards of $10,000 and consume thousands of GPU hours. When you consider the sheer number of models being released and the variations in training configurations, the evaluation matrix becomes impossibly large and expensive to fill. ...

[COFFEE-GYM: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code 🔗](https://arxiv.org/abs/2409.19715)

Debugging the debugger: How COFFEE-GYM uses RL to teach AI how to give better coding advice

Introduction We are living in the golden age of AI-assisted programming. Large Language Models (LLMs) like GPT-4 and DeepSeekCoder have become indispensable tools for developers, capable of generating complex functions and boilerplate code in seconds. However, anyone who has used these tools knows a painful truth: they are not perfect. When an LLM generates buggy code, the natural next step is to ask it to fix it. This process—code editing—is crucial. But simply generating a fixed version of the code isn’t always enough, especially for educational purposes or complex debugging. We need the model to explain what went wrong and how to fix it. We need high-quality Natural Language (NL) Feedback. ...

[CODEJUDGE: Evaluating Code Generation with Large Language Models 🔗](https://arxiv.org/abs/2410.02184)

Beyond Test Cases: How CODEJUDGE Uses Slow Thinking to Evaluate AI Code

Introduction In the rapidly evolving landscape of Artificial Intelligence, code generation has become one of the “killer apps.” Tools like GitHub Copilot and ChatGPT have transformed how developers write software, churning out functions, classes, and even entire applications in seconds. But this capability introduces a critical, often overlooked bottleneck: Evaluation. How do we know if the code an AI writes is actually good? Historically, we’ve relied on two main methods: running the code against unit tests (which requires writing those tests first) or comparing the text of the code to a “correct” reference solution (using metrics like BLEU). Both methods have severe limitations. Real-world tasks often lack test cases, and correct code can be written in a thousand different ways, making text comparison unreliable. ...

[CodeAgent: Autonomous Communicative Agents for Code Review 🔗](https://arxiv.org/abs/2402.02172)

Meet Your New AI Code Review Team: Inside the CodeAgent Framework

Code review is the backbone of high-quality software engineering. It’s the process where developers check each other’s work to spot bugs, ensure stylistic consistency, and verify that the code actually does what the commit message says it does. However, if you have ever worked in a software team, you know the reality: code review is labor-intensive, time-consuming, and prone to human error. Naturally, researchers have turned to Large Language Models (LLMs) to automate this. But there is a snag. Most existing AI tools treat code review as a simple “input-output” task—you feed in code, and the AI spits out a critique. This ignores a fundamental truth: Code review is an interactive, collaborative process. It involves understanding context, checking formatting against legacy files, and ensuring security—tasks that often require different “mindsets.” ...

[Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs 🔗](https://arxiv.org/abs/2401.10065)

Why Thinking in Python Makes LLMs Smarter — The Power of Code Prompting

Why Thinking in Python Makes LLMs Smarter: The Power of Code Prompting If you have ever tried to navigate a complex legal document or determine your eligibility for a visa, you know that the logic involved is rarely straightforward. It is a maze of conditional statements: “If you are over 18, AND you have lived here for 5 years, OR you are married to a citizen, THEN…” This is known as conditional reasoning, and it is a fundamental component of human intelligence. For Large Language Models (LLMs), however, it can be a significant stumbling block. While models like GPT-4 are impressive, they often hallucinate or lose track of logic when faced with long chains of conditions buried in natural language text. ...

[CoTKR: Chain-of-Thought Enhanced Knowledge Rewriting for Complex Knowledge Graph Question Answering 🔗](https://arxiv.org/abs/2409.19753)

Bridging the Gap: How Chain-of-Thought Rewriting Optimizes Knowledge Graphs for LLMs

Introduction Large Language Models (LLMs) like GPT-4 and Llama have revolutionized how we interact with information. They can write poetry, code websites, and answer questions on a vast array of topics. However, for all their brilliance, they have a notorious flaw: “hallucination.” When an LLM doesn’t know a specific fact—or when that fact is obscure or outdated—it often makes things up with supreme confidence. To combat this, researchers rely on Retrieval-Augmented Generation (RAG). The idea is simple: before the LLM answers a question, we retrieve relevant data from an external source (like a database or a document) and feed it to the LLM as context. ...

[CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference 🔗](https://arxiv.org/abs/2406.17626)

When "It" Becomes Dangerous: Exposing Safety Gaps in LLM Conversations

Large Language Models (LLMs) like LLaMA, GPT-4, and Claude have become incredibly adept at refusing harmful requests. If you explicitly ask a modern, safety-aligned model, “How do I make a bomb?” or “Write a hateful slur,” it will almost certainly refuse, citing ethical guidelines. This is the result of extensive “red teaming”—a process where researchers attack the model to find flaws and then patch them. However, most of this safety training focuses on single-prompt attacks. The user asks a bad question; the model says no. But real-world interactions are rarely single-turn queries. They are conversations. They involve context, back-and-forth dialogue, and linguistic shortcuts. ...

[COGEN: Learning from Feedback with Coupled Comprehension and Generation 🔗](https://arxiv.org/abs/2408.15992)

The Virtuous Cycle: How Coupling Speaking and Listening Improves AI Learning

In human cognition, speaking and listening are not isolated islands. When we listen to someone, our brains actively predict what they are about to say. Conversely, when we speak, we often simulate how our words will be received by the listener to ensure clarity. This bidirectional relationship suggests that improving one skill should naturally help the other. However, in the world of Artificial Intelligence, these two capabilities—generation (speaking) and comprehension (listening)—are often trained and treated as separate tasks. ...

[COEVOL: Constructing Better Responses for Instruction Finetuning through Multi-Agent Cooperation 🔗](https://arxiv.org/abs/2406.07054)

Beyond Data Selection: How Multi-Agent Debate Can Evolve Better LLM Responses

If you have been following the evolution of Large Language Models (LLMs), you are likely familiar with the concept of Instruction Fine-Tuning (IFT). It is the crucial step that turns a raw, text-predicting base model into a helpful assistant capable of following user commands. Recently, the research community has shifted its focus from “how much data do we need?” to “how good does the data need to be?” Papers like LIMA (Less Is More for Alignment) demonstrated that a small set of high-quality data often beats massive amounts of noisy data. This led to a gold rush of data selection methods—algorithms designed to sift through datasets and pick the “cherry” samples while discarding the “lemons.” ...

[CoCoST: Automatic Complex Code Generation with Online Searching and Correctness Testing 🔗](https://arxiv.org/abs/2403.13583)

Teaching LLMs to Code Like Humans: The CoCoST Framework

The promise of Large Language Models (LLMs) in software engineering is dazzling. You type a prompt, and the model spits out working code. For simple tasks—like writing a Fibonacci sequence or a basic SQL query—current models like GPT-4 are incredibly proficient. However, the reality of professional software development is rarely that simple. Real-world coding involves intricate libraries (like TensorFlow or Pandas), complex logic, and specific data structures. When LLMs face these “complex code generation” tasks, they often hallucinate non-existent libraries, write code that runs but produces the wrong answer, or fail to handle edge cases. ...

[CoCoLoFa: A Dataset of News Comments with Common Logical Fallacies Written by LLM-Assisted Crowds 🔗](https://arxiv.org/abs/2410.03457)

How to Build a Better Troll: Using LLMs and Crowdsourcing to Teach AI Logical Fallacies

Introduction: The Art of Bad Arguments If you have ever ventured into the comments section of a controversial news article, you have likely encountered them: arguments that sound convincing on the surface but crumble under the slightest scrutiny. A commenter might claim that implementing a small tax increase will inevitably lead to a totalitarian communist state (a Slippery Slope). Another might argue that because a specific politician is corrupt, all politicians must be criminals (a Hasty Generalization). ...

[CoBa: Convergence Balancer for Multitask Finetuning of Large Language Models 🔗](https://arxiv.org/abs/2410.06741)

Balancing Act: How CoBa Solves the Multi-Task Fine-Tuning Puzzle for LLMs

In the rapidly evolving landscape of Large Language Models (LLMs), we have moved past the initial awe of “it can speak” to the logistical nightmare of “how do we use this in production?” Imagine you are an engineer at a tech giant. You need your LLM to perform code completion in Python, translate Java to C++, and generate unit tests. The traditional approach is to fine-tune a separate model for each task. But deploying five different 13-billion parameter models is incredibly resource-heavy and inefficient. ...

[CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research 🔗](https://arxiv.org/abs/2411.01176)

Decoding the Language of Hackers: How CmdCaliper Brings Semantic Understanding to the Command Line

If you have ever stared at a terminal window during a security incident, you know that the command line is the battlefield of modern cybersecurity. For attackers, the command line interface (CLI) is the ultimate tool for execution, persistence, and privilege escalation. For defenders, it is a crime scene full of fingerprints. However, there is a significant problem in how we analyze these fingerprints. Attackers are masters of disguise. They can rewrite the same malicious logic in a dozen different ways—changing argument orders, using aliases, or obfuscating strings—to evade detection systems that rely on simple pattern matching or signature detection. ...