EMNLP 2024

[I Could've Asked That: Reformulating Unanswerable Questions 🔗](https://aclanthology.org/2024.emnlp-main.242.pdf)

Beyond 'I Don't Know': Teaching AI to Fix Our Unanswerable Questions

Introduction Imagine you are reading a dense legal contract or a complex medical journal. You aren’t an expert, so you turn to an AI assistant—like ChatGPT or a specialized document reader—to help you understand it. You ask a question based on your limited understanding: “What is the penalty if the tenant paints the walls?” The AI scans the document and replies: “The document does not mention penalties for painting walls.” ...

[Humans or LLMs as the Judge? A Study on Judgement Bias 🔗](https://aclanthology.org/2024.emnlp-main.474.pdf)

Who Watches the Watchmen? Uncovering Bias in Human and AI Judges

The explosion of Large Language Models (LLMs) like GPT-4, Claude, and Gemini has brought us remarkable capabilities in natural language processing. But with great power comes a difficult question: How do we know if these models are actually doing a good job? Evaluating an LLM isn’t like checking a math test. In open-ended tasks—like writing an essay, summarizing a story, or providing therapy-like advice—there is no single “correct” answer. Historically, we relied on humans to grade these responses. Recently, however, the field has shifted toward “LLM-as-a-judge,” where powerful models like GPT-4 are used to grade the outputs of other models. It’s faster, cheaper, and scalable. ...

[Human-LLM Hybrid Text Answer Aggregation for Crowd Annotations 🔗](https://arxiv.org/abs/2410.17099)

Beyond the Wisdom of Crowds: How Human-LLM Hybrid Frameworks Are Revolutionizing Text Annotation

The “Wisdom of Crowds” is a concept as old as statistics itself. The idea is simple: if you ask enough people to guess the number of jellybeans in a jar, the average of their guesses is often startlingly close to the truth—closer, in fact, than the guess of any single expert. In the world of Machine Learning (ML) and Natural Language Processing (NLP), we rely heavily on this principle. We use platforms like Amazon Mechanical Turk or Lancers to gather labeled data. When the task is simple, like clicking a button to say whether an image contains a cat, aggregating the answers is easy: just take the majority vote. ...

[How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective 🔗](https://arxiv.org/abs/2410.10093)

Beyond SFT: Aligning LLMs with Generalized Self-Imitation Learning (GSIL)

Beyond SFT: Aligning LLMs with Generalized Self-Imitation Learning (GSIL) Large Language Models (LLMs) are impressive, but raw pre-trained models are like brilliant but unruly students. They know a lot about the world, but they don’t always know how to behave, follow instructions, or solve complex problems step-by-step. To fix this, we perform a process called alignment. Currently, the standard recipe for alignment has two main stages: Supervised Fine-Tuning (SFT): You show the model examples of good prompts and responses, and it learns to copy them. Preference Fine-Tuning (RLHF/DPO): You show the model two responses (one good, one bad) and teach it to prefer the good one. The second stage is powerful but expensive. It requires collecting human preference data (“Response A is better than Response B”), which is costly and slow to scale. What if we could achieve the high performance of preference learning using only the demonstration data from the first stage? ...

[How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning 🔗](https://arxiv.org/abs/2402.02872)

Deconstructing In-Context Learning: The Two-Tower Mechanism Hidden Inside LLMs

Deconstructing In-Context Learning: The Two-Tower Mechanism Hidden Inside LLMs Large Language Models (LLMs) like GPT-4 and Llama have displayed a fascinating emergent ability known as In-Context Learning (ICL). This is the phenomenon where you provide a model with a few examples (demonstrations) in the prompt—like “English: Cat, French: Chat”—and the model instantly learns the pattern to complete a new example, all without any parameter updates or retraining. While we use ICL every day, the underlying mechanism remains somewhat of a “black box.” How exactly does the model move information from the demonstration examples to the final prediction? Does it actually “learn” the task, or is it just relying on pre-existing knowledge? ...

[How Susceptible are Large Language Models to Ideological Manipulation? 🔗](https://arxiv.org/abs/2402.11725)

Brainwashing AI: How Easily Can LLMs Be Ideologically Manipulated?

Large Language Models (LLMs) like ChatGPT and Llama-2 have become our digital interlocutors, helping us draft emails, summarize news, and answer complex questions. But as we increasingly rely on them for information, a critical question arises: Does the model have an ideology? And if so, can that ideology be hijacked? We often think of AI alignment as preventing models from generating hate speech or building bombs. However, a subtler and perhaps more pervasive risk exists: ideological manipulation. Can a malicious actor take a neutral model and, with a tiny amount of data, turn it into a radical partisan? ...

[How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics 🔗](https://arxiv.org/abs/2410.03429)

Stop Cheating: How to Find the "Real" Hard Questions in NLI Datasets

Imagine you are taking a multiple-choice history test. You don’t actually know the history, but you notice a pattern: every time the answer contains the word “never,” it’s the correct choice. You ace the test, scoring 100%. But have you learned history? No. You’ve just learned a statistical shortcut. This scenario describes a massive problem in current Artificial Intelligence, specifically in Natural Language Inference (NLI). Models like BERT and RoBERTa achieve superhuman scores on benchmark datasets, but they often fail when faced with real-world, nuanced language. Why? Because the datasets they are tested on are full of “spurious correlations”—linguistic shortcuts that allow models to guess the right answer without understanding the logic. ...

[How Far Can We Extract Diverse Perspectives from Large Language Models? 🔗](https://arxiv.org/abs/2311.09799)

Breaking the Echo Chamber: Can LLMs Simulate Diverse Human Perspectives?

Introduction In the world of Artificial Intelligence, Large Language Models (LLMs) are often described as “compressed knowledge.” They have devoured varied texts from millions of human authors, encompassing a vast spectrum of beliefs, cultures, and values. Yet, when we chat with a model like GPT-4, we often receive a single, polished, “majority-vote” answer. This raises a fascinating research question: If these models were trained on diverse perspectives, can we reverse-engineer them to extract that diversity? Can an LLM step out of its default persona and simulate a crowd of people with disagreeing opinions? ...

[How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning? 🔗](https://arxiv.org/abs/2404.12866)

Unlocking the Power of Text in Multimodal In-Context Learning

Introduction In the rapidly evolving world of Artificial Intelligence, Multimodal Large Language Models (MLLMs)—models that can understand both text and images—have become the new frontier. A key capability of these models is In-Context Learning (ICL). This is the ability of a model to learn a new task simply by looking at a few examples provided in the prompt, without requiring any updates to its weights (no fine-tuning necessary). For example, if you want an MLLM to write a funny caption for an image, you might first show it three examples of images with funny captions. The model “gets the idea” and applies that pattern to your new image. ...

[How Does the Disclosure of AI Assistance Affect the Perceptions of Writing? 🔗](https://arxiv.org/abs/2410.04545)

The Bias of Disclosure: How Knowing AI Helped You Write Changes How You Are Judged

Introduction We have entered a new era of digital composition. Gone are the days when “writing assistance” simply meant a red squiggly line under a misspelled word. With the advent of Large Language Models (LLMs) like GPT-4, writing has evolved into a co-creative process. Humans prompt, AI drafts, humans refine, and AI polishes. This paradigm shift raises profound questions about authorship, creativity, and quality. However, a critical psychological question remains unanswered: How do readers react when they know a piece of text was co-written by an AI? ...

[How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with Really Good Data 🔗](https://aclanthology.org/2024.emnlp-main.777.pdf)

The Illusion of Competence: Why Your Code LLM Might Be Cheating (And How to Fix It)

If you have been following the explosion of Large Language Models (LLMs) specialized for coding, you have likely seen the leaderboards. Every week, a new open-source model claims to rival GPT-4 on benchmarks like HumanEval. It seems we are in a golden age of automated programming. But there is a catch. If you take these high-flying models and test them on newer, fresher problems from competitive programming sites, their performance often collapses. Why? ...

[How Do Humans Write Code? Large Models Do It the Same Way Too 🔗](https://arxiv.org/abs/2402.15729)

Thinking Before Coding: How 'Human-Think Language' Fixes LLM Math Errors

If you have ever asked a Large Language Model (LLM) like ChatGPT or Llama to solve a complex math word problem, you might have noticed a frustrating pattern. Sometimes, it understands the logic perfectly but fails at simple arithmetic (hallucinating that \(25 \times 14 = 300\)). Other times, it writes a Python script to solve the problem, but the script solves the wrong equation entirely. This inconsistency highlights a major divide in AI reasoning. On one hand, we have Chain-of-Thought (CoT), where the model explains its reasoning in natural language. This is great for logic but prone to calculation errors. On the other hand, we have Program-of-Thought (PoT), where the model writes code to calculate the answer. This solves the calculation issue but introduces a new one: the model often fails to translate the word problem into the correct code logic. ...

[Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective 🔗](https://arxiv.org/abs/2409.10053)

Spinning the Truth: How Rotating LLM Activations is Better Than steering Them

Large Language Models (LLMs) like Llama and Mistral are incredible feats of engineering, capable of fluent reasoning and creativity. However, they are also prone to hallucinations, biases, and toxic outputs. When we want to fix these behaviors, our traditional toolkit—like fine-tuning—can be computationally expensive and sometimes compromises the model’s general capabilities. Recently, a technique called Activation Editing (or Representation Engineering) has emerged as a surgical alternative. Instead of retraining the model weights, we intervene during inference, tweaking the internal “thoughts” (activations) of the model to guide it toward honesty or safety. ...

[Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries 🔗](https://arxiv.org/abs/2406.12775)

The Clock is Ticking: Why LLMs Fail at Multi-Step Reasoning

Large Language Models (LLMs) often feel indistinguishable from magic. They can write poetry, code in Python, and summarize history. Yet, for all their prowess, they frequently stumble on questions that require simple, sequential logic—what researchers call “multi-hop queries.” Consider the question: “The spouse of the performer of Imagine is…” For a human, this is a straightforward two-step process. Hop 1: Who performed Imagine? -> John Lennon. Hop 2: Who is the spouse of John Lennon? -> Yoko Ono. Ideally, an LLM should perform this exact sequence internally. However, models often get this wrong. They might hallucinate an answer or simply fail to make the connection, despite “knowing” both individual facts (that Lennon sang Imagine and that Yoko Ono is his spouse). ...

[Holistic Evaluation for Interleaved Text-and-Image Generation 🔗](https://arxiv.org/abs/2406.14643)

Beyond Text-to-Image: How Do We Evaluate AI That Tells Stories with Pictures?

Introduction Imagine you ask an AI to write a tutorial on “How to bake sourdough bread.” You don’t just want a wall of text; you want step-by-step instructions interleaved with photos of the dough rising, the scoring pattern, and the final golden loaf. Or perhaps you want an AI to generate a children’s book where the text and illustrations flow together naturally on every page. This capability is known as interleaved text-and-image generation. While we have mastered text generation (thanks to LLMs like GPT-4) and made massive strides in image generation (Stable Diffusion, DALL-E), combining them into a single, coherent narrative stream is a frontier challenge. ...

[Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction 🔗](https://arxiv.org/abs/2409.16783)

Breaking to Build: How HARM Automates Red Teaming for Safer LLMs

As Large Language Models (LLMs) become integrated into everything from code generation to legal advice, the stakes for safety have never been higher. We know these models are trained on the vast, unfiltered internet, meaning they inherently “know” how to generate hate speech, instructions for illegal acts, or biased content. The challenge lies in preventing them from ever outputting it. The industry standard for safety testing is Red Teaming—a practice adopted from cybersecurity where a group of testers (the “red team”) actively attacks the system to find vulnerabilities. In the context of LLMs, this means trying to trick the model into saying something it shouldn’t. ...

[Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Utilization 🔗](https://arxiv.org/abs/2406.19502)

Can AI Really Reason? Deconstructing LLM Knowledge from Foundation to Strategy

Large Language Models (LLMs) like GPT-4 and LLaMA have become eerily good at answering complex scientific and mathematical questions. If you ask an LLM, “Why does ReLU activation train faster than sigmoid?”, it will likely give you a coherent, textbook-quality paragraph about gradients and saturation. But this capability triggers a nagging question for researchers and students alike: Is the model actually reasoning, or is it just parroting a memorized block of text? ...

[Hidden Persuaders: LLMs' Political Leaning and Their Influence on Voters 🔗](https://arxiv.org/abs/2410.24190)

The AI Ballot: How LLMs Lean Left and Persuade Voters

Artificial Intelligence has rapidly transitioned from a novelty to a daily utility. We use Large Language Models (LLMs) to draft emails, summarize news, and explain complex concepts. Implicit in this usage is a presumption of neutrality—we often treat these models as objective synthesizers of information. However, a recent study titled “Hidden Persuaders: LLMs’ Political Leaning and Their Influence on Voters” by researchers at UC Berkeley and the University of Chicago challenges this assumption. The paper investigates a critical question for modern democracy: Do LLMs hold political biases, and if so, can they unintentionally sway the electorate? ...

[HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy 🔗](https://arxiv.org/abs/2401.15207)

Breaking the GPU Memory Wall: How HiFT Enables Full-Parameter Fine-Tuning on Consumer Hardware

If you have ever tried to fine-tune a Large Language Model (LLM) like LLaMA or RoBERTa, you have likely run into the “memory wall.” You download the model, set up your PyTorch training loop, and hit run, only to be immediately greeted by the dreaded CUDA Out of Memory (OOM) error. The culprit is usually Full-Parameter Fine-Tuning (FPFT). While FPFT is the gold standard for adapting a model to a new task—allowing the model to adjust every single weight to learn new patterns—it is exorbitantly expensive. It requires storing not just the model weights, but also the gradients and, crucially, the optimizer states (like momentum in AdamW) for every single parameter simultaneously. ...

[Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models 🔗](https://arxiv.org/abs/2401.06432)

Taming the Edge - How HETLORA Adapts Foundation Models to Heterogeneous Devices

Introduction We are living in the era of Foundation Models (FMs). From Chatbots to code assistants, Large Language Models (LLMs) have demonstrated incredible capabilities in zero-shot and few-shot learning. However, there is a massive friction point in the current AI ecosystem: privacy. Most powerful models reside in centralized data centers. To fine-tune these models on sensitive private data—like medical records, legal documents, or personal chat history—users typically have to upload their data to the cloud. This is a privacy nightmare that Federated Learning (FL) aims to solve. FL allows models to be trained across distributed devices (clients) without the data ever leaving the device. ...