EMNLP 2024

[Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective 🔗](https://arxiv.org/abs/2406.17969)

Untangling the Black Box: Why Monosemanticity is Key to Better LLM Alignment

Introduction Imagine trying to understand how a complex alien brain works. You probe a single neuron, hoping it corresponds to a specific thought like “happiness” or “the color red.” Instead, that single neuron fires for a chaotic mix of concepts: a specific preposition, the mention of the French Revolution, and the closing bracket of a Python function. This is the reality of polysemanticity in Large Language Models (LLMs). For years, researchers in mechanistic interpretability have struggled with the fact that neural networks are “black boxes.” A major hurdle is that individual neurons often represent multiple, unrelated concepts simultaneously. The “holy grail” of interpretability is achieving monosemanticity—a state where one neuron (or feature) corresponds to exactly one understandable concept. ...

[Encoding and Controlling Global Semantics for Long-form Video Question Answering 🔗](https://arxiv.org/abs/2405.19723)

Beyond the Clip: Mastering Long-Form Video Understanding with Gated State Space Models

Imagine you are watching a superhero movie. In the first act, the protagonist realizes a specific component in their suit is poisoning them. An hour later, they discover a new element to replace it. In the final battle, that new element powers the suit to victory. Now, imagine I ask you: “What would have happened if the hero hadn’t replaced the component?” To answer this, you need to connect the poisoning event from hour 0 to the victory in hour 2. You need the global context—the entire narrative arc. ...

[Encoding Spreadsheets for Large Language Models 🔗](https://aclanthology.org/2024.emnlp-main.1154.pdf)

Beyond the Grid: How SHEETENCODER Teaches LLMs to Read Excel

Introduction In the world of data, the spreadsheet is king. From small businesses to Fortune 500 companies, Microsoft Excel and Google Sheets are the default operating systems for structured data. Yet, for all their ubiquity, spreadsheets remain a massive blind spot for today’s most powerful Artificial Intelligence tools. While Large Language Models (LLMs) like GPT-4 and Llama 3 have mastered prose, code, and even poetry, they struggle significantly with the two-dimensional grid of a spreadsheet. Why? Because LLMs read text sequentially—left to right, top to bottom. A spreadsheet, however, is spatial. A cell at Z100 might be mathematically dependent on A1, but they are miles apart in a linear text stream. Furthermore, the sheer volume of tokens required to represent a sparse, formatted grid often creates a “context overflow,” crashing the model’s token limit before it can even begin to reason. ...

[Empowering Multi-step Reasoning across Languages via Program-Aided Language Models 🔗](https://aclanthology.org/2024.emnlp-main.678.pdf)

Breaking the Language Barrier in AI Math: An Introduction to Cross-PAL

Mathematics is often called the universal language. A calculation like \(20 - 12 + 5\) yields the same result whether you describe the problem in English, Chinese, or Swahili. However, for Large Language Models (LLMs), this universality is not a given. While models like GPT-4 exhibit impressive reasoning capabilities in English, their performance often degrades significantly when prompted in low-resource languages. The challenge lies in multi-step reasoning. Solving a word problem requires understanding the narrative, planning a logical sequence of steps, and executing calculations. When an LLM is forced to do this in a language it wasn’t heavily trained on, the cognitive load is often too high, leading to errors. ...

[Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting 🔗](https://arxiv.org/abs/2410.00771)

Can AI Remember What It Watched? Solving Catastrophic Forgetting in VideoQA

If you spend any time on the internet, you know that video content is exploding. From YouTube tutorials to TikTok trends, the volume of video data generated daily is staggering. For Artificial Intelligence, specifically Video Question Answering (VideoQA) models, this presents a massive challenge. We typically train these models on static datasets. Once trained, they are frozen. But the world isn’t static. If we want an AI to understand new types of content or answer new kinds of questions, we usually have to fine-tune it. Here lies the problem: when you teach a Large Language Model (LLM) new tricks, it often forgets the old ones. This phenomenon is known as Catastrophic Forgetting. ...

[Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training 🔗](https://arxiv.org/abs/2410.04439)

Teaching Diffusion Models to Spell: A Deep Dive into Input Granularity and Glyph-Aware Training

Introduction If you have experimented with text-to-image diffusion models like Stable Diffusion or Midjourney, you have likely encountered the “gibberish text” phenomenon. You ask for a sign that says “Welcome Home,” and the model generates a beautiful living room with a sign that reads “Wleom Hmeo.” While diffusion models have mastered lighting, texture, and composition, they notoriously struggle with visual text generation. The letters are often distorted, words are misspelled, or the text is ignored entirely. While commercial models like DALL-E 3 are improving, open-source backbone models still lag behind, particularly when it comes to languages other than English, such as Chinese. ...

[EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models 🔗](https://arxiv.org/abs/2312.14069)

More Than Just Words: Benchmarking Emphasis in Speech-to-Speech AI

Have you ever had a text message misunderstood because the recipient couldn’t hear how you said it? The sentence “I never said he stole my bag” has seven different meanings depending on which of the seven words you emphasize. “I never said he stole my bag” implies someone else said it. “I never said he stole my bag” implies you might have thought it, but didn’t voice it. “I never said he stole my bag” implies he might have borrowed it. This nuance is called prosody, often described as the “music of speech.” It includes rhythm, pitch, and loudness. While modern AI has made massive strides in Speech-to-Speech (S2S) translation and generation (think real-time translation devices or voice cloning), it often fails to capture this critical layer of meaning. Most models are “flat”—they get the words right, but the intent wrong. ...

[Emotion Granularity from Text: An Aggregate-Level Indicator of Mental Health 🔗](https://arxiv.org/abs/2403.02281)

Decoding Mental Health: How Social Media Text Reveals Emotion Granularity

Introduction: The “Affective Soup” Imagine you are having a terrible day. When a friend asks you how you are feeling, what do you say? Do you reply that you are “frustrated” because a project stalled, “anxious” about an upcoming deadline, and “disappointed” in a colleague? Or do you simply say you feel “bad” or “stressed”? This distinction—the ability to identify and label emotions with specificity—is what psychologists call Emotion Granularity (EG). ...

[EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control 🔗](https://arxiv.org/abs/2410.00316)

EmoKnob: How Vector Arithmetic Can Give AI Voices a Soul

Introduction Consider the famous Shakespearean line: “To be, or not to be.” Read that text again. How did it sound in your head? Was it whispered in despair? Was it spoken with philosophical contemplation? Or perhaps shouted in defiance? The text itself is ambiguous. In human communication, the words are only half the story; the vocal inflection—the prosody—carries the rest. This is the current frontier in Text-to-Speech (TTS) and voice cloning technology. While modern models like those from OpenAI or ElevenLabs can generate voices that sound hyper-realistic, they suffer from a significant limitation: rigidity. ...

[Embedding and Gradient Say Wrong: A White-Box Method for Hallucination Detection 🔗](https://aclanthology.org/2024.emnlp-main.116.pdf)

Inside the Black Box: Using Gradients and Embeddings to Catch LLM Hallucinations

Inside the Black Box: Using Gradients and Embeddings to Catch LLM Hallucinations Large Language Models (LLMs) like GPT-4 and LLaMa have transformed how we interact with information. They write code, compose poetry, and answer complex questions. But they have a notorious flaw: hallucinations. We’ve all seen it—an LLM confidently asserts a “fact” that is completely made up, citing non-existent court cases or inventing historical events. For students and researchers in NLP, solving hallucination is the “holy grail” of current reliability research. Most existing solutions treat the model as a “black box,” simply asking it, “Are you sure about that?” via prompting. But what if we could look under the hood? ...

[Embedded Named Entity Recognition using Probing Classifiers 🔗](https://arxiv.org/abs/2403.11747)

Streaming NER: How to Extract Entities from LLMs in Real-Time Without Fine-Tuning

When we interact with modern Large Language Models (LLMs) like GPT-4 or Llama, we usually experience them in a “streaming” format. Words appear one by one, creating the illusion of a conversation. But for developers and researchers building complex applications—like automated fact-checkers or knowledge graph builders—this streaming text presents a challenge. How do you extract structured data, such as names, locations, or dates (Named Entity Recognition, or NER), from a stream of text that hasn’t finished generating yet? ...

[Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence 🔗](https://arxiv.org/abs/2406.10957)

Quality Over Quantity: Fixing the Verbosity Trap in LLM Alignment with SamPO

Introduction In the rapidly evolving world of Large Language Models (LLMs), bigger isn’t always better—especially when it comes to the length of the model’s response. If you have interacted with modern chatbots, you might have noticed a peculiar habit: they love to ramble. Ask a simple question, and you often get a wall of text. This phenomenon is known as verbosity. While we want models to be thorough, we don’t want them to confuse “long” with “correct.” ...

[EfficientRAG: Efficient Retriever for Multi-Hop Question Answering 🔗](https://arxiv.org/abs/2408.04259)

EfficientRAG: Solving Multi-Hop QA Without Breaking the Bank

EfficientRAG: Solving Multi-Hop QA Without Breaking the Bank In the rapidly evolving landscape of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become the gold standard for grounding AI responses in reality. By fetching relevant data from external sources, RAG reduces hallucinations and enables models to answer questions about specific, private, or up-to-the-minute data. However, there is a class of questions that continues to stump standard RAG systems: multi-hop questions. These are complex queries that require multiple steps of reasoning to answer, such as “Who is the director of the movie that starred the lead actor of ‘Titanic’?” ...

[Efficient Unseen Language Adaptation for Multilingual Pre-Trained Language Models 🔗](https://aclanthology.org/2024.emnlp-main.1057.pdf)

Breaking the Language Barrier: How Soft Prompts Enable AI to Learn Unseen Languages Efficiently

In the world of Natural Language Processing (NLP), Multilingual Pre-trained Language Models (mPLMs) like BERT and XLM-R are the polyglots of the AI world. They are trained on text from approximately 100 different languages, allowing them to perform tasks—like sentiment analysis or topic classification—across borders. However, there is a catch. There are over 7,000 languages spoken worldwide. What happens when we need to use these models on a “low-resource” language that wasn’t included in their original training data? This is the problem of unseen language adaptation. ...

[Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge 🔗](https://arxiv.org/abs/2402.16050)

TGB: Bridging the Gap in Long-Form Video Understanding for MLLMs

Introduction: The “Long Video” Bottleneck Imagine asking an AI to watch a two-hour movie and answer the question: “Why did the protagonist hesitate before opening the door in the second act?” For a human, this is a trivial task of perception and memory. For a Multimodal Large Language Model (MLLM), this is a computational nightmare. While MLLMs have made incredible strides in understanding static images, applying them to long-form videos presents a massive hurdle. Videos contain thousands of frames. Feeding all of them into a standard MLLM would explode the “context window”—the limit on how much information a model can process at once—and bring even the most powerful GPUs to their knees. ...

[Efficient Sequential Decision Making with Large Language Models 🔗](https://arxiv.org/abs/2406.12125)

The Best of Both Worlds: Combining LLMs and Contextual Bandits for Efficient Decision Making

Introduction In the rapidly evolving landscape of Artificial Intelligence, Large Language Models (LLMs) have established themselves as the undisputed kings of knowledge and reasoning. From writing code to summarizing history, their capabilities are vast. However, a significant gap remains between generating text and taking optimal actions in a dynamic environment. Consider a recommendation system, a healthcare triage bot, or a digital assistant. These are sequential decision-making problems. The agent observes a context, selects an action, and receives feedback (a reward or loss). Traditionally, we solve these with algorithms like Contextual Bandits. These algorithms are computationally cheap and learn well over the long term, but they suffer from the “cold start” problem—they begin with zero knowledge and must flail around randomly to learn what works. ...

[Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards 🔗](https://arxiv.org/abs/2409.12656)

Can AI Track Its Own Progress? Automating Scientific Leaderboards with LLMs

Introduction We are living through an explosion of scientific research. In the field of Computation and Language alone, approximately 100 new papers are uploaded to arXiv every single day. For a researcher, student, or practitioner, keeping up with this torrent of information is no longer just difficult—it is humanly impossible. The core question everyone asks is: “What is currently the state-of-the-art?” To answer this, the community relies on Scientific Leaderboards. These are ranked lists that track how well different models perform on specific tasks (like translation or summarization) using specific datasets. Platforms like Papers With Code or NLP-progress have become the de facto homepages for researchers trying to benchmark their work. ...

[Efficient Overshadowed Entity Disambiguation by Mitigating Shortcut Learning 🔗](https://aclanthology.org/2024.emnlp-main.855.pdf)

Breaking the Habit - How Counterfactual Training Fixes Entity Disambiguation

If you read the sentence, “Michael Jordan published a new paper on machine learning,” who do you picture? If you are like most people—and more importantly, like most Machine Learning models—you probably immediately thought of the basketball legend, #23 of the Chicago Bulls. But you would be wrong. The sentence refers to Michael I. Jordan, a renowned computer science professor at UC Berkeley. This specific problem is known as overshadowing in Natural Language Processing (NLP). When an ambiguous name (like “Michael Jordan”) is shared by a very popular entity and a less common one, models almost exclusively predict the popular one, ignoring the context clues that suggest otherwise. ...

[Efficient LLM Comparative Assessment: A Product of Experts Framework for Pairwise Comparisons 🔗](https://arxiv.org/abs/2405.05894)

LLMs Judging LLMs: How to Rank Texts Efficiently with a Product of Experts Framework

As Large Language Models (LLMs) continue to dominate the landscape of Natural Language Processing, a secondary, equally difficult problem has emerged: How do we evaluate them? When an LLM generates a summary, a story, or a line of dialogue, there is rarely a single “correct” answer. Traditional metrics like BLEU or ROUGE, which rely on word overlap with a reference text, often fail to capture nuances like coherence, creativity, or helpfulness. This has led to the rise of LLM-as-a-judge, where we use a stronger model (like GPT-4 or Llama-2-Chat) to grade the outputs of other models. ...

[Effective Synthetic Data and Test-Time Adaptation for OCR Correction 🔗](https://aclanthology.org/2024.emnlp-main.862.pdf)

Fixing History’s Typos—How Synthetic Data and Self-Correction Are Revolutionizing OCR

Introduction Imagine walking into a library that contains every book ever written. Now, imagine that for millions of those books, the pages are riddled with gibberish. “The cat sat on the mat” might read as “The c@t s4t on tbe mAt.” This is the current reality of Digital Humanities. While Optical Character Recognition (OCR) technology has allowed us to digitize vast archives of historical texts—from 19th-century novels to ancient newspapers—it is far from perfect. Faded ink, complex layouts, and unusual typefaces often confuse OCR engines, resulting in “noisy” text that is difficult for humans to read and even harder for computers to analyze. ...