EMNLP 2024

[PARIKSHA: Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data 🔗](https://arxiv.org/abs/2406.15053)

PARIKSHA: Uncovering the Truth About Multilingual LLM Evaluation

Introduction In the rapidly evolving world of Large Language Models (LLMs), benchmarks are the compass by which we navigate progress. We look at leaderboards to see which model is “smarter,” “faster,” or “safer.” However, there is a glaring blind spot in this landscape: linguistic and cultural diversity. Most standard benchmarks are English-centric. When multilingual benchmarks do exist, they often suffer from two critical flaws. First, test set contamination: because popular benchmarks are available on the web, models often ingest the questions during training, effectively memorizing the answers. Second, lack of cultural nuance: many benchmarks are simply English questions translated into other languages, losing the local context, idioms, and cultural values that define true fluency. ...

[PANDA: Persona Attributes Navigation for Detecting and Alleviating Overuse Problem in Large Language Models 🔗](https://aclanthology.org/2024.emnlp-main.670.pdf)

TMI! Why LLMs Share Too Much and How the PANDA Framework Fixes It

Introduction Imagine you are chatting with a new acquaintance. You mention that you enjoy reading mystery novels. A normal response might be, “Oh, I love those too! Who is your favorite author?” Now imagine the acquaintance responds: “I love reading too! I am a 35-year-old accountant living in Chicago. I have three cats named Mittens, Oreo, and Luna. I suffer from anxiety and I go to the gym every Tuesday at 6 PM.” ...

[PALM: Few-Shot Prompt Learning for Audio Language Models 🔗](https://arxiv.org/abs/2409.19806)

Beyond Hand-Crafted Prompts: Optimizing Audio-Language Models with PALM

Introduction In the rapidly evolving world of Artificial Intelligence, multimodal models—systems that can understand and process multiple types of data like text, images, and audio—are breaking new ground. Just as Vision-Language Models (VLMs) like CLIP revolutionized computer vision by connecting images to natural language, Audio-Language Models (ALMs) are doing the same for sound. These models allow for Zero-Shot Audio Recognition. Imagine playing a sound clip of a dog barking to an AI model that has never been explicitly trained to classify “dog barks.” Instead, you simply provide the text “A recording of a dog,” and the model matches the audio features to the text features, correctly identifying the sound. ...

[Overcome Noise and Bias: Segmentation-Aided Multi-Granularity Denoising and Debiasing for Enhanced Quadruples Extraction in Dialogue 🔗](https://aclanthology.org/2024.emnlp-main.49.pdf)

Taming the Chaos: How to Extract Sentiment Quadruples from Messy Dialogues without Noise and Bias

Sentiment analysis has come a long way from simply classifying a movie review as “positive” or “negative.” In the era of granular data analytics, we are interested in Aspect-Based Sentiment Analysis (ABSA). We don’t just want to know if a user is happy; we want to know what they are happy about, which specific feature they like, and what opinion words they used. This brings us to the Sentiment Quadruple: A structured set of four elements: ...

[Outcome-Constrained Large Language Models for Countering Hate Speech 🔗](https://arxiv.org/abs/2403.17146)

Beyond Politeness—Teaching AI to De-escalate Hate Speech

If you have spent any time in the comment sections of social media platforms like Reddit or X (formerly Twitter), you know how quickly conversations can spiral into toxicity. Hate speech remains a persistent challenge in online communities, threatening healthy discourse and driving users away. For years, researchers have been developing automated methods to generate “counterspeech”—direct responses designed to refute or neutralize hate speech. The logic is simple: if we can automate the moderation process or assist human moderators with suggested replies, we can scale up the fight against toxicity. ...

[Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding 🔗](https://arxiv.org/abs/2402.13720)

Ouroboros: Breaking the Speed Limit of LLMs with Phrase-Based Speculative Decoding

Large Language Models (LLMs) have revolutionized how we interact with information, but they suffer from a persistent bottleneck: latency. If you have ever watched ChatGPT type out an answer word by word, you have experienced the limitations of autoregressive decoding. Because every new token depends on the previous one, models must generate output sequentially. This process is slow and computationally inefficient, leaving expensive GPUs idle while waiting for memory access. ...

[Order of Magnitude Speedups for LLM Membership Inference 🔗](https://arxiv.org/abs/2409.14513)

Auditing LLMs for Privacy — How to Slash the Cost of Membership Inference Attacks

Large Language Models (LLMs) are trained on massive datasets scraped from the internet, often containing sensitive personal information, proprietary code, or copyrighted works. This creates a significant privacy risk: these models can “memorize” their training data. If an adversary can query an LLM and determine whether a specific document was part of its training set, they have successfully mounted a Membership Inference Attack (MIA). For organizations deploying LLMs, auditing these models for privacy leaks is crucial. However, the current “gold standard” for auditing—training “shadow models”—is prohibitively expensive. It requires training multiple copies of LLMs just to test one. ...

[Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning 🔗](https://aclanthology.org/2024.emnlp-main.565.pdf)

Juggling Act: How 'Fast RL' Balances Conflicting Goals in LLM Training

Reinforcement Learning from Human Feedback (RLHF) is the secret sauce behind the success of modern Large Language Models (LLMs) like ChatGPT and Llama. It’s the process that turns a raw, text-predicting engine into a helpful assistant. But there is a hidden complexity in this process: we rarely want an AI to do just one thing. We want models to be helpful and harmless. We want them to be creative and factual. We want them to be concise and complete. ...

[Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs 🔗](https://arxiv.org/abs/2406.11695)

Beyond Manual Prompting: Automating Multi-Stage LM Programs with MIPRO

As Large Language Models (LMs) evolve, we are moving past simple, single-turn “chat” interfaces. Today, the cutting edge of NLP involves Language Model Programs: sophisticated pipelines where multiple LM calls are chained together to solve complex tasks. Imagine a system that retrieves information from Wikipedia, summarizes it, reasons about that summary, and then formulates a final answer. Each step is a distinct “module” requiring its own prompt. However, this complexity introduces a massive bottleneck. How do you design prompts for a pipeline with five different stages? If the final answer is wrong, which prompt was to blame? Tuning these systems by hand (“prompt engineering”) is a tedious, trial-and-error process that becomes mathematically impossible as the number of modules grows. ...

[Optimizing Code Retrieval: High-Quality and Scalable Dataset Annotation through Large Language Models 🔗](https://aclanthology.org/2024.emnlp-main.123.pdf)

Query4Code: Teaching LLMs to Annotate Code for Better Search Engines

If you are a software developer, your browser history is likely filled with searches like “how to reverse a list in Python” or “pandas dataframe drop duplicates.” This process—finding the right code snippet based on a natural language description—is known as Code Retrieval. For AI models to get good at this, they need massive amounts of training data consisting of pairs: a user query (Natural Language) and the corresponding code snippet (Programming Language). But where does this data come from? ...

[Optimizing Chinese Lexical Simplification Across Word Types: A Hybrid Approach 🔗](https://aclanthology.org/2024.emnlp-main.849.pdf)

Can Small Models Beat GPT-4? A Hybrid Approach to Chinese Lexical Simplification

Have you ever read a sentence that felt like a brick wall because of a single, obscure word? In English, you might stumble over “esoteric” and wish the writer had just used “mysterious.” In Chinese, the challenge is often compounded by Chengyu (four-character idioms) or rapidly evolving internet slang. This process of making text easier to read by swapping difficult words for simpler equivalents is called Lexical Simplification (LS). It’s a vital tool for language learners, children, and people with cognitive impairments. ...

[Optimized Speculative Sampling for GPU Hardware Accelerators 🔗](https://arxiv.org/abs/2406.11016)

Breaking the Memory Wall: optimizing Speculative Sampling on GPUs

If you have ever stared at a blinking cursor waiting for a Large Language Model (LLM) to finish a paragraph, you have experienced the inherent bottleneck of autoregressive generation. These models generate text one token at a time, and for every single token, the model must move massive amounts of parameters from memory to the compute units. In the world of GPU acceleration, this is known as being “memory bandwidth bound.” The speed of generation isn’t determined by how fast the GPU can do math, but by how fast it can move data. ...

[OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation 🔗](https://arxiv.org/abs/2409.19270)

Hearing the Unseen: How OpenSep Uses LLMs to Automate Audio Separation

Imagine standing in the middle of a bustling city street. You hear a cacophony of sounds: a car honking, a child yelling, footsteps on the pavement, and perhaps a distant siren. As a human, your brain performs a miraculous feat called the “cocktail party effect”—you can focus on the child yelling while tuning out the car horn. You can isolate specific sounds from a complex mixture almost instantly. For machines, however, this task—known as audio source separation—is notoriously difficult. While deep learning has made strides in separating specific sounds (like vocals from a music track), the “open world” presents a much harder challenge. Real-world audio mixtures contain a variable number of sources, many of which a model may never have encountered during training. ...

[Open-world Multi-label Text Classification with Extremely Weak Supervision 🔗](https://arxiv.org/abs/2407.05609)

Mapping the Unknown: How X-MLClass Solves Multi-Label Classification Without Labels

Imagine you are handed a massive library of books. Your job is to organize them into categories. But there’s a catch: you don’t know what the categories are yet, and many books belong to multiple categories simultaneously (e.g., a book could be about “History,” “War,” and “Biography”). You have no list of genres, no Dewey Decimal System, and no labeled examples. All you have is a vague instruction: “Organize these by topic.” ...

[Ontologically Faithful Generation of Non-Player Character Dialogues 🔗](https://arxiv.org/abs/2212.10618)

Can AI Write Video Game Characters? Inside the KNUDGE Dataset and Dialogue Generation

If you have ever played a massive open-world Role-Playing Game (RPG) like Skyrim, The Witcher, or The Outer Worlds, you know that the immersion depends heavily on the people you meet. Non-Player Characters (NPCs) are the lifeblood of these worlds. They give you quests, explain the history of the land, and react to your decisions. But creating these interactions is incredibly expensive. A Triple-A game might have thousands of NPCs, each requiring unique dialogue trees that must remain consistent with the game’s lore and the current state of the player’s quest. It is a logistical nightmare that costs millions of dollars and years of writing time. ...

[OneNet: A Fine-Tuning Free Framework for Few-Shot Entity Linking via Large Language Model Prompting 🔗](https://arxiv.org/abs/2410.07549)

How OneNet Solves Entity Linking with LLMs Without Fine-Tuning

Introduction Imagine reading a news headline: “Jordan played a magnificent game last night.” As a human, you immediately look for context. Are we talking about Michael Jordan, the basketball legend? Jordan, the country in the Middle East? Or perhaps Jordan, a local high school player? This process of mapping a mention in text (like “Jordan”) to a specific, unique identity in a knowledge base (like a Wikipedia page) is called Entity Linking (EL). ...

[ONE2SET + Large Language Model: Best Partners for Keyphrase Generation 🔗](https://arxiv.org/abs/2410.03421)

The Dynamic Duo: How Combining Set Generation with LLMs Solves Keyphrase Prediction

In the vast ocean of digital information, finding exactly what you need often relies on a very small set of words: Keyphrases. Keyphrase Generation (KPG) is the task of automatically reading a document and producing a concise list of phrases that capture its core concepts. Ideally, these keyphrases act as an index, helping with information retrieval, text summarization, and categorization. However, KPG is deceptively difficult. It requires two competing capabilities: ...

[One-to-Many Communication and Compositionality in Emergent Communication 🔗](https://aclanthology.org/2024.emnlp-main.1157.pdf)

Beyond Private Chats: How Broadcasting Shapes the Evolution of Language

Language is rarely a private affair. When a sergeant shouts a command to a squad, or an advertiser broadcasts a commercial to millions, a single message must be understood by multiple people simultaneously. Yet, in the world of Artificial Intelligence and “Emergent Communication,” we have mostly studied language as a one-on-one game: one speaker, one listener. If we want AI agents to develop languages that look and behave like human language, we need to replicate the pressures under which human language evolved. One of the most critical properties of human language is compositionality—the ability to combine simple units (like words) according to rules (grammar) to create complex meanings. “Blue square” is compositional; a unique, unrelated sound for every possible colored shape is not. ...

[One Thousand and One Pairs: A 'novel' challenge for long-context language models 🔗](https://arxiv.org/abs/2406.16264)

Beyond the Haystack: Why Long-Context LLMs Struggle to Read a Novel

Introduction In the rapid evolution of Large Language Models (LLMs), one metric has become a major bragging right: the context window. We have moved from models that could remember a few paragraphs to behemoths like Gemini 1.5 Pro and GPT-4o, which claim to process hundreds of thousands, if not millions, of tokens at once. In theory, you can now feed an entire novel into an AI and ask questions about it. ...

[On the Universal Truthfulness Hyperplane Inside LLMs 🔗](https://arxiv.org/abs/2407.08582)

Inside the Mind of an LLM: Hunting for the Universal Truthfulness Hyperplane

Large Language Models (LLMs) like GPT-4 and LLaMA-2 have revolutionized how we interact with information. They can write code, summarize novels, and answer complex queries. Yet, they have a notorious flaw: hallucination. An LLM can confidently state that the Eiffel Tower is in Berlin or invent court cases that never happened. This raises a fascinating, almost philosophical question for AI researchers: Does the model know it is lying? When an LLM outputs a falsehood, is it because the model truly believes that information is correct? Or does the model contain the correct information deep within its internal representations, but somehow fails to output it? ...