Papers

[Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs 🔗](https://arxiv.org/abs/2406.11695)

Beyond Manual Prompting: Automating Multi-Stage LM Programs with MIPRO

As Large Language Models (LMs) evolve, we are moving past simple, single-turn “chat” interfaces. Today, the cutting edge of NLP involves Language Model Programs: sophisticated pipelines where multiple LM calls are chained together to solve complex tasks. Imagine a system that retrieves information from Wikipedia, summarizes it, reasons about that summary, and then formulates a final answer. Each step is a distinct “module” requiring its own prompt. However, this complexity introduces a massive bottleneck. How do you design prompts for a pipeline with five different stages? If the final answer is wrong, which prompt was to blame? Tuning these systems by hand (“prompt engineering”) is a tedious, trial-and-error process that becomes mathematically impossible as the number of modules grows. ...

[Optimizing Code Retrieval: High-Quality and Scalable Dataset Annotation through Large Language Models 🔗](https://aclanthology.org/2024.emnlp-main.123.pdf)

Query4Code: Teaching LLMs to Annotate Code for Better Search Engines

If you are a software developer, your browser history is likely filled with searches like “how to reverse a list in Python” or “pandas dataframe drop duplicates.” This process—finding the right code snippet based on a natural language description—is known as Code Retrieval. For AI models to get good at this, they need massive amounts of training data consisting of pairs: a user query (Natural Language) and the corresponding code snippet (Programming Language). But where does this data come from? ...

[Optimizing Chinese Lexical Simplification Across Word Types: A Hybrid Approach 🔗](https://aclanthology.org/2024.emnlp-main.849.pdf)

Can Small Models Beat GPT-4? A Hybrid Approach to Chinese Lexical Simplification

Have you ever read a sentence that felt like a brick wall because of a single, obscure word? In English, you might stumble over “esoteric” and wish the writer had just used “mysterious.” In Chinese, the challenge is often compounded by Chengyu (four-character idioms) or rapidly evolving internet slang. This process of making text easier to read by swapping difficult words for simpler equivalents is called Lexical Simplification (LS). It’s a vital tool for language learners, children, and people with cognitive impairments. ...

[Optimized Speculative Sampling for GPU Hardware Accelerators 🔗](https://arxiv.org/abs/2406.11016)

Breaking the Memory Wall: optimizing Speculative Sampling on GPUs

If you have ever stared at a blinking cursor waiting for a Large Language Model (LLM) to finish a paragraph, you have experienced the inherent bottleneck of autoregressive generation. These models generate text one token at a time, and for every single token, the model must move massive amounts of parameters from memory to the compute units. In the world of GPU acceleration, this is known as being “memory bandwidth bound.” The speed of generation isn’t determined by how fast the GPU can do math, but by how fast it can move data. ...

[OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation 🔗](https://arxiv.org/abs/2409.19270)

Hearing the Unseen: How OpenSep Uses LLMs to Automate Audio Separation

Imagine standing in the middle of a bustling city street. You hear a cacophony of sounds: a car honking, a child yelling, footsteps on the pavement, and perhaps a distant siren. As a human, your brain performs a miraculous feat called the “cocktail party effect”—you can focus on the child yelling while tuning out the car horn. You can isolate specific sounds from a complex mixture almost instantly. For machines, however, this task—known as audio source separation—is notoriously difficult. While deep learning has made strides in separating specific sounds (like vocals from a music track), the “open world” presents a much harder challenge. Real-world audio mixtures contain a variable number of sources, many of which a model may never have encountered during training. ...

[Open-world Multi-label Text Classification with Extremely Weak Supervision 🔗](https://arxiv.org/abs/2407.05609)

Mapping the Unknown: How X-MLClass Solves Multi-Label Classification Without Labels

Imagine you are handed a massive library of books. Your job is to organize them into categories. But there’s a catch: you don’t know what the categories are yet, and many books belong to multiple categories simultaneously (e.g., a book could be about “History,” “War,” and “Biography”). You have no list of genres, no Dewey Decimal System, and no labeled examples. All you have is a vague instruction: “Organize these by topic.” ...

[Ontologically Faithful Generation of Non-Player Character Dialogues 🔗](https://arxiv.org/abs/2212.10618)

Can AI Write Video Game Characters? Inside the KNUDGE Dataset and Dialogue Generation

If you have ever played a massive open-world Role-Playing Game (RPG) like Skyrim, The Witcher, or The Outer Worlds, you know that the immersion depends heavily on the people you meet. Non-Player Characters (NPCs) are the lifeblood of these worlds. They give you quests, explain the history of the land, and react to your decisions. But creating these interactions is incredibly expensive. A Triple-A game might have thousands of NPCs, each requiring unique dialogue trees that must remain consistent with the game’s lore and the current state of the player’s quest. It is a logistical nightmare that costs millions of dollars and years of writing time. ...

[OneNet: A Fine-Tuning Free Framework for Few-Shot Entity Linking via Large Language Model Prompting 🔗](https://arxiv.org/abs/2410.07549)

How OneNet Solves Entity Linking with LLMs Without Fine-Tuning

Introduction Imagine reading a news headline: “Jordan played a magnificent game last night.” As a human, you immediately look for context. Are we talking about Michael Jordan, the basketball legend? Jordan, the country in the Middle East? Or perhaps Jordan, a local high school player? This process of mapping a mention in text (like “Jordan”) to a specific, unique identity in a knowledge base (like a Wikipedia page) is called Entity Linking (EL). ...

[ONE2SET + Large Language Model: Best Partners for Keyphrase Generation 🔗](https://arxiv.org/abs/2410.03421)

The Dynamic Duo: How Combining Set Generation with LLMs Solves Keyphrase Prediction

In the vast ocean of digital information, finding exactly what you need often relies on a very small set of words: Keyphrases. Keyphrase Generation (KPG) is the task of automatically reading a document and producing a concise list of phrases that capture its core concepts. Ideally, these keyphrases act as an index, helping with information retrieval, text summarization, and categorization. However, KPG is deceptively difficult. It requires two competing capabilities: ...

[One-to-Many Communication and Compositionality in Emergent Communication 🔗](https://aclanthology.org/2024.emnlp-main.1157.pdf)

Beyond Private Chats: How Broadcasting Shapes the Evolution of Language

Language is rarely a private affair. When a sergeant shouts a command to a squad, or an advertiser broadcasts a commercial to millions, a single message must be understood by multiple people simultaneously. Yet, in the world of Artificial Intelligence and “Emergent Communication,” we have mostly studied language as a one-on-one game: one speaker, one listener. If we want AI agents to develop languages that look and behave like human language, we need to replicate the pressures under which human language evolved. One of the most critical properties of human language is compositionality—the ability to combine simple units (like words) according to rules (grammar) to create complex meanings. “Blue square” is compositional; a unique, unrelated sound for every possible colored shape is not. ...

[One Thousand and One Pairs: A 'novel' challenge for long-context language models 🔗](https://arxiv.org/abs/2406.16264)

Beyond the Haystack: Why Long-Context LLMs Struggle to Read a Novel

Introduction In the rapid evolution of Large Language Models (LLMs), one metric has become a major bragging right: the context window. We have moved from models that could remember a few paragraphs to behemoths like Gemini 1.5 Pro and GPT-4o, which claim to process hundreds of thousands, if not millions, of tokens at once. In theory, you can now feed an entire novel into an AI and ask questions about it. ...

[On the Universal Truthfulness Hyperplane Inside LLMs 🔗](https://arxiv.org/abs/2407.08582)

Inside the Mind of an LLM: Hunting for the Universal Truthfulness Hyperplane

Large Language Models (LLMs) like GPT-4 and LLaMA-2 have revolutionized how we interact with information. They can write code, summarize novels, and answer complex queries. Yet, they have a notorious flaw: hallucination. An LLM can confidently state that the Eiffel Tower is in Berlin or invent court cases that never happened. This raises a fascinating, almost philosophical question for AI researchers: Does the model know it is lying? When an LLM outputs a falsehood, is it because the model truly believes that information is correct? Or does the model contain the correct information deep within its internal representations, but somehow fails to output it? ...

[On the Role of Context in Reading Time Prediction 🔗](https://arxiv.org/abs/2409.08160)

Is Context Overrated? Rethinking Surprisal Theory in Reading Time Prediction

If you have ever caught yourself finishing someone else’s sentence, you intuitively understand that language processing is predictive. When we read or listen, we don’t just passively receive words; our brains actively anticipate what comes next based on the context. In the field of psycholinguistics, this phenomenon is formalized as Surprisal Theory. The core tenet is simple yet powerful: the processing effort required for a word (often measured by how long our eyes linger on it) is proportional to its “surprisal”—or how unexpected it is given the preceding context. A highly predictable word is processed quickly; a surprising word causes a stutter in our cognitive flow, leading to longer reading times. ...

[On the Robustness of Editing Large Language Models 🔗](https://arxiv.org/abs/2402.05827)

The Fragile Memory of AI: Why Editing LLMs is Harder Than It Looks

Imagine you are training a new employee. You tell them, “The project manager is no longer Alice; it’s now Bob.” A human employee immediately updates their mental model. They won’t accidentally call Alice the manager during a lunch break, nor will they get confused if you ask, “Who is the person in charge of the project?” using slightly different phrasing. Now, consider Large Language Models (LLMs). We often view them as static repositories of information trained on massive datasets. But facts change. Prime ministers resign, companies rebrand, and scientific theories evolve. Retraining an entire multi-billion parameter model for every minor update is computationally impossible. ...

[On the Relationship between Truth and Political Bias in Language Models 🔗](https://arxiv.org/abs/2409.05283)

The Truth Paradox: Why Teaching AI to Be Honest Might Make It Partisan

Introduction: The Alignment Trilemma In the world of Artificial Intelligence, researchers are constantly chasing the “Holy Grail” of alignment. We want Large Language Models (LLMs) like ChatGPT or Claude to possess three core attributes: we want them to be helpful, we want them to be harmless, and we want them to be truthful. On the surface, these seem like complementary goals. A truthful assistant is surely a helpful one, right? However, a fascinating new research paper from the MIT Center for Constructive Communication and the MIT Media Lab suggests that these objectives might actually be in tension with one another. specifically, the researchers investigate a startling correlation: optimizing a model for truthfulness seems to inadvertently pull it toward a left-leaning political bias. ...

[Marginalizing Out Tokenization in Surprisal-Based Psycholinguistic Predictive Modeling 🔗](https://arxiv.org/abs/2410.02691)

Beyond the Token: Rethinking How We Model Human Reading with AI

Language models (LMs) like GPT-4 or Llama have revolutionized natural language processing, but they have also become indispensable tools for a completely different field: Computational Psycholinguistics. Researchers use these models to test theories about how the human brain processes language. The dominant theory in this space is Surprisal Theory, which posits that the difficulty of processing a word is proportional to how “surprised” the brain is to see it. If a language model assigns a low probability to a word, it has high surprisal, and—theory holds—a human will take longer to read it. ...

[On the Influence of Gender and Race in Romantic Relationship Prediction from Large Language Models 🔗](https://arxiv.org/abs/2410.03996)

What's in a Name? How LLMs Reveal Heteronormative and Racial Biases in Relationship Prediction

Introduction “What’s in a name?” is a question that has echoed through literature for centuries. In the context of human interaction, names often carry signals about gender, race, and ethnicity—signals that humans use, sometimes subconsciously, to make assumptions about the people behind the names. As Large Language Models (LLMs) become increasingly integrated into social computing tasks, a critical question arises: do these models mirror our societal biases when interpreting these signals? ...

[On the In-context Generation of Language Models 🔗](https://aclanthology.org/2024.emnlp-main.568.pdf)

Decoding In-Context Generation: How LLMs Learn to Create Novel Patterns

If you have played with Large Language Models (LLMs) like GPT-4 or Llama, you are intimately familiar with their ability to follow patterns. You provide a few examples—say, a list of movie titles followed by emojis—and the model picks up on the vibe, generating new examples that fit the pattern perfectly. This phenomenon is often grouped under In-Context Learning (ICL). But there is a nuance here that often goes overlooked. LLMs don’t just classify things (learning labels); they can generate complex, structured sequences that continue a specific “topic” or format defined in your prompt. The researchers behind the paper “On the In-context Generation of Language Models” call this In-Context Generation (ICG). ...

[On the Fragility of Active Learners for Text Classification 🔗](https://arxiv.org/abs/2403.15744)

Is Active Learning Actually Worth It? A Reality Check for Text Classification

If you have ever worked on a supervised machine learning project in a professional setting, you have likely encountered the labeling bottleneck. You have access to a massive amount of raw text data—customer reviews, medical abstracts, or news articles—but your budget for human annotation is painfully small. You simply cannot afford to label 100,000 examples. Enter Active Learning (AL). The promise of Active Learning is seductive. Instead of labeling random data points, the algorithm acts like a smart student, explicitly asking the teacher (the human annotator) to label only the most confusing or informative examples. The theory is that by labeling the “right” data, you can reach high model accuracy with a fraction of the budget. ...

[On Training Data Influence of GPT Models 🔗](https://arxiv.org/abs/2404.07840)

Unlocking the Black Box - How Specific Training Data Shapes GPT Performance

The capabilities of Large Language Models (LLMs) like GPT-4, Llama, and Mistral have exploded in recent years. We marvel at their ability to write code, summarize diverse texts, and answer complex questions. Yet, for all their power, the training process remains largely a “black box.” We know that we feed these models massive datasets—trillions of tokens—and they emerge as intelligent agents. But if a model is particularly good at summarizing legal documents, which specific training examples caused that skill to emerge? Conversely, if a model hallucinates facts about history, which bad data points are to blame? ...