EMNLP 2024

[On the Role of Context in Reading Time Prediction 🔗](https://arxiv.org/abs/2409.08160)

Is Context Overrated? Rethinking Surprisal Theory in Reading Time Prediction

If you have ever caught yourself finishing someone else’s sentence, you intuitively understand that language processing is predictive. When we read or listen, we don’t just passively receive words; our brains actively anticipate what comes next based on the context. In the field of psycholinguistics, this phenomenon is formalized as Surprisal Theory. The core tenet is simple yet powerful: the processing effort required for a word (often measured by how long our eyes linger on it) is proportional to its “surprisal”—or how unexpected it is given the preceding context. A highly predictable word is processed quickly; a surprising word causes a stutter in our cognitive flow, leading to longer reading times. ...

[On the Robustness of Editing Large Language Models 🔗](https://arxiv.org/abs/2402.05827)

The Fragile Memory of AI: Why Editing LLMs is Harder Than It Looks

Imagine you are training a new employee. You tell them, “The project manager is no longer Alice; it’s now Bob.” A human employee immediately updates their mental model. They won’t accidentally call Alice the manager during a lunch break, nor will they get confused if you ask, “Who is the person in charge of the project?” using slightly different phrasing. Now, consider Large Language Models (LLMs). We often view them as static repositories of information trained on massive datasets. But facts change. Prime ministers resign, companies rebrand, and scientific theories evolve. Retraining an entire multi-billion parameter model for every minor update is computationally impossible. ...

[On the Relationship between Truth and Political Bias in Language Models 🔗](https://arxiv.org/abs/2409.05283)

The Truth Paradox: Why Teaching AI to Be Honest Might Make It Partisan

Introduction: The Alignment Trilemma In the world of Artificial Intelligence, researchers are constantly chasing the “Holy Grail” of alignment. We want Large Language Models (LLMs) like ChatGPT or Claude to possess three core attributes: we want them to be helpful, we want them to be harmless, and we want them to be truthful. On the surface, these seem like complementary goals. A truthful assistant is surely a helpful one, right? However, a fascinating new research paper from the MIT Center for Constructive Communication and the MIT Media Lab suggests that these objectives might actually be in tension with one another. specifically, the researchers investigate a startling correlation: optimizing a model for truthfulness seems to inadvertently pull it toward a left-leaning political bias. ...

[Marginalizing Out Tokenization in Surprisal-Based Psycholinguistic Predictive Modeling 🔗](https://arxiv.org/abs/2410.02691)

Beyond the Token: Rethinking How We Model Human Reading with AI

Language models (LMs) like GPT-4 or Llama have revolutionized natural language processing, but they have also become indispensable tools for a completely different field: Computational Psycholinguistics. Researchers use these models to test theories about how the human brain processes language. The dominant theory in this space is Surprisal Theory, which posits that the difficulty of processing a word is proportional to how “surprised” the brain is to see it. If a language model assigns a low probability to a word, it has high surprisal, and—theory holds—a human will take longer to read it. ...

[On the Influence of Gender and Race in Romantic Relationship Prediction from Large Language Models 🔗](https://arxiv.org/abs/2410.03996)

What's in a Name? How LLMs Reveal Heteronormative and Racial Biases in Relationship Prediction

Introduction “What’s in a name?” is a question that has echoed through literature for centuries. In the context of human interaction, names often carry signals about gender, race, and ethnicity—signals that humans use, sometimes subconsciously, to make assumptions about the people behind the names. As Large Language Models (LLMs) become increasingly integrated into social computing tasks, a critical question arises: do these models mirror our societal biases when interpreting these signals? ...

[On the In-context Generation of Language Models 🔗](https://aclanthology.org/2024.emnlp-main.568.pdf)

Decoding In-Context Generation: How LLMs Learn to Create Novel Patterns

If you have played with Large Language Models (LLMs) like GPT-4 or Llama, you are intimately familiar with their ability to follow patterns. You provide a few examples—say, a list of movie titles followed by emojis—and the model picks up on the vibe, generating new examples that fit the pattern perfectly. This phenomenon is often grouped under In-Context Learning (ICL). But there is a nuance here that often goes overlooked. LLMs don’t just classify things (learning labels); they can generate complex, structured sequences that continue a specific “topic” or format defined in your prompt. The researchers behind the paper “On the In-context Generation of Language Models” call this In-Context Generation (ICG). ...

[On the Fragility of Active Learners for Text Classification 🔗](https://arxiv.org/abs/2403.15744)

Is Active Learning Actually Worth It? A Reality Check for Text Classification

If you have ever worked on a supervised machine learning project in a professional setting, you have likely encountered the labeling bottleneck. You have access to a massive amount of raw text data—customer reviews, medical abstracts, or news articles—but your budget for human annotation is painfully small. You simply cannot afford to label 100,000 examples. Enter Active Learning (AL). The promise of Active Learning is seductive. Instead of labeling random data points, the algorithm acts like a smart student, explicitly asking the teacher (the human annotator) to label only the most confusing or informative examples. The theory is that by labeling the “right” data, you can reach high model accuracy with a fraction of the budget. ...

[On Training Data Influence of GPT Models 🔗](https://arxiv.org/abs/2404.07840)

Unlocking the Black Box - How Specific Training Data Shapes GPT Performance

The capabilities of Large Language Models (LLMs) like GPT-4, Llama, and Mistral have exploded in recent years. We marvel at their ability to write code, summarize diverse texts, and answer complex questions. Yet, for all their power, the training process remains largely a “black box.” We know that we feed these models massive datasets—trillions of tokens—and they emerge as intelligent agents. But if a model is particularly good at summarizing legal documents, which specific training examples caused that skill to emerge? Conversely, if a model hallucinates facts about history, which bad data points are to blame? ...

[On Sensitivity of Learning with Limited Labelled Data to the Effects of Randomness: Impact of Interactions and Systematic Choices 🔗](https://arxiv.org/abs/2402.12817)

The Butterfly Effect in NLP: Disentangling Randomness in Few-Shot Learning

In the world of Machine Learning, particularly Natural Language Processing (NLP), we often chase the highest accuracy score on a benchmark. But there is a ghost in the machine: randomness. Imagine you are training a model with very limited data—perhaps a few-shot classification task. You run the experiment and get an F1 score of 85%. You are ecstatic. But then, you change the “random seed”—a simple integer that controls how data is shuffled or how weights are initialized—and run it again. This time, the score drops to 60%. ...

[On Mitigating Performance Disparities in Multilingual Speech Recognition 🔗](https://aclanthology.org/2024.emnlp-main.323.pdf)

Having Your Cake and Eating It Too: Balancing Accuracy and Fairness in ASR with Adapter Fusion

Introduction Imagine using a voice assistant that understands your brother perfectly but struggles to comprehend a single sentence you say. For millions of users, this isn’t a hypothetical scenario—it is the reality of interacting with modern AI. Automatic Speech Recognition (ASR) systems have become ubiquitous, powering everything from virtual assistants like Siri and Alexa to automated customer service lines and dictation software. However, despite their widespread adoption, these systems often suffer from significant performance disparities. They might work flawlessly for male speakers of English but struggle with female speakers or speakers of lower-resource languages. ...

[On Fake News Detection with LLM Enhanced Semantics Mining 🔗](https://aclanthology.org/2024.emnlp-main.31.pdf)

Can LLMs Catch Fake News? Why Semantics Matter More Than Style

In the digital age, the rapid dissemination of information is a double-edged sword. While we have instant access to news, we are also bombarded by misinformation. Detecting fake news has become one of the most critical challenges in computer science and social media analysis. For a long time, researchers relied heavily on social context—who retweeted whom, how quickly a post spread, and user comments—to identify fake news. But there is a glaring problem with this approach: privacy restrictions and early detection. Often, social context data is unavailable, incomplete, or arrives too late. We need methods that can look at the content of the news itself and determine its authenticity. ...

[On Eliciting Syntax from Language Models via Hashing 🔗](https://arxiv.org/abs/2410.04074)

Hacking Syntax: How Hashing and Contrastive Learning Reveal Grammar in LLMs

If you have ever played with Word2Vec or early language models, you are likely familiar with the famous algebraic miracle of NLP: King - Man + Woman = Queen. This vector arithmetic suggested that language models (LMs) don’t just memorize text; they implicitly learn semantic and syntactic structures. However, extracting that structure explicitly—drawing the actual grammar tree of a sentence—usually requires supervised training on expensive, hand-labeled datasets like the Penn Treebank. But what if we could get a pre-trained Large Language Model (LLM) to confess its grammatical knowledge without ever showing it a labeled parse tree? ...

[On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning 🔗](https://arxiv.org/abs/2406.11823)

Breaking the Resolution Curse: How ELVA Makes Vision-Language Models Efficient

Introduction: The Cost of Seeing Clearly In the rapidly evolving world of Artificial Intelligence, Multimodal Large Language Models (MLLMs)—models that can see and talk—have become the new frontier. Systems like GPT-4V have demonstrated incredible capabilities, describing complex scenes and answering questions about images. However, a significant bottleneck remains: efficiency. For a model to understand text inside an image (like reading a receipt or analyzing a chart), it typically needs high-resolution inputs. High resolution means dividing the image into thousands of small patches (tokens). For a standard Transformer architecture, more tokens result in quadratically higher computational costs. This creates a barrier for real-world applications where latency and memory are limited. ...

[OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer 🔗](https://arxiv.org/abs/2406.16620)

How OmAgent Solves the "Long Video" Problem with Divide-and-Conquer and Recursive Rewinding

Imagine you are trying to find a specific detail in a 24-hour CCTV recording or a dense three-hour film—perhaps the exact moment a character dropped a cigarette or the license plate of a car that appeared for two seconds. As a human, you wouldn’t memorize every pixel of the video. Instead, you would watch it, form a general impression of the plot, and when asked a specific question, you would scrub through the timeline, “rewinding” to the relevant section to inspect the details. ...

[Oddballs and Misfits: Detecting Implicit Abuse in Which Identity Groups are Depicted as Deviating from the Norm 🔗](https://aclanthology.org/2024.emnlp-main.132.pdf)

When "Normal" is a Weapon: Detecting Implicit Abuse and Norm Deviation in NLP

Introduction In the early days of content moderation, detecting abusive language was largely a game of keyword matching. If a comment contained a racial slur, a curse word, or an explicit threat, it was flagged. But as Natural Language Processing (NLP) has advanced, so too has the subtlety of online abuse. Consider the difference between these two sentences: “You are a stupid idiot.” “Gays sprinkle flour over their gardens for good luck.” The first is explicitly abusive; it uses clear, negative vocabulary. The second sentence, however, is perplexing. It contains no slurs. It contains no angry words. Grammatically, it is a neutral, declarative statement. Yet, if you encountered this sentence on social media, you would likely recognize it as a form of abuse. It is painting a specific identity group as “weird,” “other,” or fundamentally different from the rest of society. ...

[ORPO: Monolithic Preference Optimization without Reference Model 🔗](https://arxiv.org/abs/2403.07691)

One Step to Align Them All: Understanding ORPO

Large Language Models (LLMs) are impressive, but raw pre-trained models are like unpolished gems. They can predict the next token, but they often struggle to follow instructions or adhere to human safety standards. To fix this, we typically rely on a multi-stage training pipeline: Pre-training, Supervised Fine-Tuning (SFT), and finally, Preference Alignment (using methods like RLHF or DPO). While effective, this pipeline is complex, resource-intensive, and brittle. In this post, we are diving deep into a paper from KAIST AI that challenges this status quo. The paper, “ORPO: Monolithic Preference Optimization without Reference Model,” introduces a method to merge Supervised Fine-Tuning and Preference Alignment into a single, efficient process. If you are a student of NLP or machine learning, understanding ORPO offers a fascinating look into how we can make model alignment cheaper, faster, and surprisingly, more effective. ...

[OATH-Framess: Characterizing Online Attitudes Towards Homelessness with LLM Assistants 🔗](https://arxiv.org/abs/2406.14883)

Beyond Sentiment: How LLMs Help Uncover the Nuances of Online Discourse on Homelessness

Social media platforms have become the de facto town squares of the 21st century. They are repositories of public opinion, offering researchers a massive dataset on how society feels about critical issues. However, for social scientists, this scale presents a paradox: the data is abundant, but understanding it at a granular level is incredibly difficult. Take the issue of homelessness in the United States. It is a complex, sensitive topic that evokes a wide spectrum of emotions—from sympathy and calls for aid to anger and resentment. Traditional Natural Language Processing (NLP) tools, like sentiment analysis (positive vs. negative) or toxicity detection, are often too blunt for this job. A tweet criticizing the government’s housing policy might be “negative,” but it’s not necessarily “toxic.” Conversely, a tweet making a subtle, harmful stereotype about people experiencing homelessness (PEH) might slip past a toxicity filter entirely. ...

[Numerologic: Number Encoding for Enhanced LLMs' Numerical Reasoning 🔗](https://arxiv.org/abs/2404.00459)

Why LLMs Can't Count: Fixing Numerical Reasoning with NumeroLogic

It is one of the great ironies of modern Artificial Intelligence: a Large Language Model (LLM) like GPT-4 can write a sonnet in the style of Shakespeare, debug complex Python code, and pass the Bar exam, yet it often stumbles when asked to multiply two three-digit numbers. For students and researchers exploring the architecture of Transformers, this behavior can be baffling. Computers are, at their core, calculators. Why is the most advanced “computer brain” we’ve ever built so bad at basic arithmetic? ...

[Null-Shot Prompting: Rethinking Prompting Large Language Models With Hallucination 🔗](https://aclanthology.org/2024.emnlp-main.740.pdf)

The Pinocchio Strategy: Boosting LLM Performance by Encouraging Hallucination

In the world of Large Language Models (LLMs), “hallucination” is usually a dirty word. It refers to the moment an AI confidently asserts that the moon is made of green cheese or invents a historical event that never happened. Researchers spend millions of dollars and countless hours trying to stop models from hallucinating. But what if hallucination isn’t just a bug? What if it’s a feature that, when manipulated correctly, can actually make a model smarter? ...

[NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data 🔗](https://arxiv.org/abs/2402.15343)

How to Train a Tiny NER Model to Rival LLMs: Inside NuNER

Named Entity Recognition (NER) is one of the bread-and-butter tasks of Natural Language Processing. Whether it is extracting stock tickers from financial news, identifying proteins in biomedical papers, or parsing dates from legal contracts, NER is everywhere. For years, the standard workflow for building a custom NER model has been rigid: take a pre-trained foundation model like BERT or RoBERTa, hire humans to annotate thousands of examples for your specific entities, and fine-tune the model. This process is slow, expensive, and inflexible. ...