EMNLP 2024

[MULTI-NEWS+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation 🔗](https://arxiv.org/abs/2404.09682)

Cleaning Up the Mess: How LLMs Can Fix Noisy Datasets Automatically

Introduction: The “Garbage In, Garbage Out” Dilemma In the world of Machine Learning, there is an old adage that every student learns in their first semester: “Garbage In, Garbage Out.” No matter how sophisticated your neural network architecture is—whether it’s a state-of-the-art Transformer or a massive Large Language Model (LLM)—it cannot learn effectively if the data it is fed is flawed. For years, the gold standard for solving this problem was human annotation. If a dataset was messy, you hired humans to read it, label it, and clean it. But as datasets have exploded in size, reaching millions of examples, relying on human labor has become prohibitively expensive and slow. This leaves researchers in a bind: do we accept noisy data and lower performance, or do we burn through budgets cleaning it? ...

[Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models 🔗](https://arxiv.org/abs/2406.17169)

Can AI Think Deeply? Unpacking Multi-LogiEval and the Limits of LLM Logical Reasoning

Introduction: The Illusion of Intelligence Large Language Models (LLMs) like GPT-4 and Gemini have captivated the world with their ability to write code, compose poetry, and pass standardized tests. When you chat with these models, their fluency can easily be mistaken for deep understanding. They seem to reason, argue, and deduce. But are they actually performing logical reasoning, or are they simply excellent pattern matchers mimicking the structure of an argument? ...

[Multi-Level Cross-Modal Alignment for Speech Relation Extraction 🔗](https://aclanthology.org/2024.emnlp-main.668.pdf)

Bridging the Gap between Speech and Knowledge: A Multi-Level Alignment Approach

In the world of Natural Language Processing (NLP), extracting structured knowledge—like relations between entities—from unstructured text is a well-established field. We have sophisticated models that can read a sentence like “Steve Jobs co-founded Apple” and extract the triplet (Steve Jobs, Founder, Apple). But what about speech? A massive amount of human knowledge is exchanged via podcasts, meetings, phone calls, and news broadcasts. Historically, extracting relations from speech (SpeechRE) has been treated as a two-step pipeline: transcribe the audio to text using Automatic Speech Recognition (ASR), and then run a text-based relation extraction model. While functional, this approach is prone to “error propagation”—if the ASR mishears a name, the relation extraction fails immediately. ...

[Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges 🔗](https://arxiv.org/abs/2410.03458)

Decoding the Sound of Vietnam: A Deep Dive into the ViMD Dataset and Multi-Dialect AI

Language is rarely a monolith. If you have ever tried to build a speech recognition system, you know that a “standard” language model often falls apart when faced with the rich tapestry of real-world accents and dialects. This is particularly true for Vietnamese, a tonal language where meaning shifts with the pitch of your voice, and where regional variations can be drastic. For years, research into Vietnamese Speech Recognition (SR) and Dialect Identification (DI) has operated under a simplified assumption: that the country is divided into three broad dialect regions—Northern, Central, and Southern. While linguistically useful, this generalization glosses over the nuanced reality that each of Vietnam’s 63 provinces has its own unique “provincial dialect.” ...

[MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning 🔗](https://arxiv.org/abs/2405.07551)

Bridging the Gap: How MuMath-Code Teaches LLMs to Think and Code Simultaneously

If you have ever asked a standard Large Language Model (LLM) to solve a complex math problem, you might have noticed a frustrating pattern. The model often writes a beautiful, confident explanation, but then stumbles on the actual arithmetic, delivering a wrong answer with absolute conviction. Conversely, models designed to write code can calculate perfectly but often struggle to understand the nuances of a word problem. In the race to achieve GPT-4 level performance with open-source models, researchers have generally split into two camps: those who teach models to “think” better through reasoning (Chain-of-Thought), and those who teach models to “use tools” (like a Python interpreter). ...

[More Than Catastrophic Forgetting: Integrating General Capabilities For Domain-Specific LLMs 🔗](https://arxiv.org/abs/2405.17830)

Beyond Forgetting: How ALoRA Teaches LLMs to Think Like Experts Without Losing Their Common Sense

Large Language Models (LLMs) are the polymaths of the AI world. They can write code, solve math problems, summarize history, and chat about ethics—all in the same session. But when we need an LLM to be an expert—say, a legal consultant or a medical diagnostic tool—we have to send it back to school. This process is called Supervised Fine-Tuning (SFT). Here lies a classic problem in machine learning: when you teach a model too much about a specific domain, it often overwrites what it learned before. It might become a brilliant lawyer but suddenly forget how to do basic arithmetic or lose its ability to chat naturally. This phenomenon is famously known as Catastrophic Forgetting (CF). ...

[More Insightful Feedback for Tutoring: Enhancing Generation Mechanisms and Automatic Evaluation 🔗](https://aclanthology.org/2024.emnlp-main.605.pdf)

Beyond "Try Again": Teaching AI to Give Better Feedback

Introduction Imagine you are learning a new language or studying for a history exam using an online platform. You encounter a question about a text you just read: “Why did the protagonist stay home?” You confidently answer: “Because he was sick.” The system responds: “Incorrect. Try again.” This is the verification stage. It tells you that you are wrong, but it doesn’t tell you why, nor does it help you find the right answer. Now, imagine a better system. One that responds: “Actually, the text mentions he was feeling fine, but look closely at what happened to his car.” ...

[More DWUGs: Extending and Evaluating Word Usage Graph Datasets in Multiple Languages 🔗](https://aclanthology.org/2024.emnlp-main.796.pdf)

Building Better Word Graphs: How More Data and Denser Connections Reveal True Meaning

Language is a moving target. Words like “plane” or “mouse” mean something very different today than they did two hundred years ago. To teach computers how to understand these shifts—a field known as Lexical Semantic Change Detection (LSCD)—researchers need high-quality data. They need a way to map how a word is used in thousands of different contexts. Enter the Word Usage Graph (WUG). This innovative approach moves away from rigid dictionary definitions and instead relies on how words relate to one another in actual sentences. ...

[Moral Foundations of Large Language Models 🔗](https://arxiv.org/abs/2310.15337)

Do Androids Dream of Moral Values? Analyzing the Hidden Ethics of LLMs

Introduction In the last few years, Large Language Models (LLMs) like GPT-3 and PaLM have moved from research labs to the center of our digital lives. We use them to write emails, debug code, and even seek life advice. But as we integrate these systems into society, a critical question arises: Do these models have a moral compass? We know that LLMs are trained on massive datasets scraped from the internet—a corpus of text that contains the best and worst of humanity. It is well-documented that these models can absorb toxic biases related to gender, race, and religion. But what about deeper, more abstract psychological frameworks? Do LLMs exhibit a consistent set of values? And if so, can those values be manipulated? ...

[MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction 🔗](https://arxiv.org/abs/2408.01426)

Beyond SMILES: How MolTRES Revolutionizes Molecular Property Prediction

The pharmaceutical and materials science industries are currently undergoing a massive shift from traditional “wet lab” experiments to computational “dry lab” predictions. Deep Neural Networks (DNNs) are at the forefront of this revolution, promising to reduce the cost and time required to discover new drugs. A popular approach in this field is Chemical Language Representation Learning. Just as Large Language Models (LLMs) like GPT learn to understand English by reading billions of sentences, chemical models learn to understand molecules by reading billions of SMILES (Simplified Molecular-Input Line Entry System) strings. SMILES represents a 3D molecule as a 1D string of text (e.g., CCO for ethanol). ...

[Modular Pluralism: Pluralistic Alignment via Multi-LLM Collaboration 🔗](https://arxiv.org/abs/2406.15951)

Beyond the Average: How Modular Pluralism Teaches LLMs to Represent Diverse Human Values

Beyond the Average: How Modular Pluralism Teaches LLMs to Represent Diverse Human Values In the rapid evolution of Large Language Models (LLMs), “alignment” has become a buzzword. We want our AI assistants to be helpful, harmless, and honest. Typically, this is achieved through techniques like Reinforcement Learning from Human Feedback (RLHF), where models are trained to prefer responses that humans rate highly. But here is the catch: Who are these humans? ...

[Modeling User Preferences with Automatic Metrics: Creating a High-Quality Preference Dataset for Machine Translation 🔗](https://arxiv.org/abs/2410.07779)

Bridging the Gap: How Automatic Metrics Can Create Better Human-Aligned Translation Models

Machine translation (MT) has come a long way from the clunky, word-for-word substitutions of the past. Today, Large Language Models (LLMs) can translate with impressive fluency. However, “fluent” doesn’t always mean “perfect.” In many cases, a translation can be grammatically correct but fail to capture the subtle tone, cultural nuance, or specific style a user prefers. This brings us to a significant challenge in modern AI: Alignment. How do we teach a model not just to predict the next word, but to choose the best translation among several valid options? ...

[Modeling Nonnative Sentence Processing with L2 Language Models 🔗](https://aclanthology.org/2024.emnlp-main.283.pdf)

The Bilingual Brain of AI: Can Language Models Simulate Second Language Acquisition?

If you have ever tried to learn a second language (L2) as an adult, you know the struggle. You might know the vocabulary, but you might find yourself instinctively arranging words using the grammar rules of your native language (L1). This phenomenon is known as L1 transfer. For example, a native Spanish speaker might say “the car red” instead of “the red car” because adjectives follow nouns in Spanish. In the world of Natural Language Processing (NLP), researchers are increasingly asking: Can we simulate this cognitive process in machines? Can we build “L2 Language Models” that mimic how non-native speakers process English? ...

[Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding 🔗](https://arxiv.org/abs/2409.19672)

Beyond Linear Text: Rethinking Reading Order in Document AI as a Graph

When we read a novel, the process is straightforward: left to right, top to bottom, line by line. But consider how you read a receipt, a newspaper with multiple columns, or a complex form. You might scan the header, jump to a specific table, read down a column, and then skip to the total at the bottom. This is the challenge of Visually-rich Documents (VrDs). In the field of Document AI, understanding the “Reading Order” is crucial. If a model reads a two-column document straight across the page (crossing the gutter), the sentences become nonsense. ...

[Model-based Preference Optimization in Abstractive Summarization without Human Feedback 🔗](https://arxiv.org/abs/2409.18618)

Hallucination Hunting: How LLMs Can Teach Themselves to Be More Faithful

Large Language Models (LLMs) are incredible writers. They are fluent, creative, and can summarize complex documents in seconds. However, anyone who has used LLMs extensively knows their fatal flaw: hallucination. They often generate text that sounds plausible but contains incorrect or contradictory information. For abstractive summarization—where the goal is factual accuracy and conciseness—hallucination is a dealbreaker. The standard solution to align models with human intent is Reinforcement Learning from Human Feedback (RLHF). While effective, RLHF is expensive, slow, and paradoxically, not always reliable for fact-checking. Humans often prefer summaries that read well over those that are strictly accurate. ...

[Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation 🔗](https://arxiv.org/abs/2406.13663)

Peeking Under the Hood: How to Verify RAG Citations Using Model Internals

Introduction In the rapidly evolving landscape of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become the gold standard for Question Answering. By allowing models to fetch relevant documents from an external database before answering a query, we have significantly reduced—though not eliminated—the problem of “hallucinations.” RAG promises a world where AI responses are grounded in fact rather than statistical probability alone. However, a critical problem remains: Trust. When an LLM provides an answer accompanied by a citation (e.g., “[1]”), we instinctively trust it. We assume the model read Document [1] and derived the answer from it. But often, this is an illusion. Models can hallucinate citations just as easily as they hallucinate facts. They might generate a correct answer based on pre-training memory but slap a random citation on it to satisfy a prompt’s formatting requirements. This phenomenon is known as being “right for the wrong reasons.” ...

[Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue 🔗](https://arxiv.org/abs/2401.04700)

The Hidden Cost of Knowledge: Why Model Editing Breaks LLMs and How to Fix It

Large Language Models (LLMs) like LLaMA and GPT have revolutionized how we interact with information. However, they have a persistent flaw: their knowledge is static. If a model was trained in 2020, it believes the world froze in that year. When the President of the United States changes, or a new scientific discovery is made, the model remains blissfully ignorant, often hallucinating outdated answers. Retraining these massive models from scratch for every new fact is prohibitively expensive and slow. Enter Model Editing—a technique designed to surgically update specific facts within a model’s neural network without a full retrain. It sounds like the perfect solution: efficient, targeted, and fast. ...

[Model Balancing Helps Low-data Training and Fine-tuning 🔗](https://arxiv.org/abs/2410.12178)

Balancing Act — How Layer-Wise Learning Rates Rescue Low-Data Fine-Tuning

Introduction In the current era of Artificial Intelligence, the paradigm of “pre-train, then fine-tune” has become the standard. We take massive Foundation Models (FMs)—whether they are Large Language Models (LLMs) like LLaMA or scientific models—and adapt them to specific tasks. Usually, this works remarkably well. However, there is a catch: fine-tuning typically requires a high-quality, curated dataset. But what happens when that dataset is tiny? In many real-world scenarios, data is scarce. This is particularly true in Scientific Machine Learning (SciML), where generating data points might involve running expensive physical simulations (like fluid dynamics) that take days to complete. When we try to fine-tune large models on just a handful of examples (a “low-data regime”), performance often collapses, becomes unstable, or creates models that fail to generalize. ...

[MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning 🔗](https://arxiv.org/abs/2412.07405)

Mastering Multi-Task LLMs: Inside the MoDULA Architecture

Introduction In the current landscape of Artificial Intelligence, Large Language Models (LLMs) like LLaMA, Qwen, and Yi are becoming the bedrock of modern NLP. However, there is a persistent tension in the development of these models: the tug-of-war between generalization and specialization. We want models that can chat fluently about the weather (generalization) but also solve complex calculus problems or provide accurate medical diagnoses (specialization). Traditionally, achieving this required massive computational resources to fine-tune the entire model, often leading to “catastrophic forgetting”—where learning a new task (like coding) makes the model worse at old tasks (like creative writing). ...

[MoCoKGC: Momentum Contrast Entity Encoding for Knowledge Graph Completion 🔗](https://aclanthology.org/2024.emnlp-main.832.pdf)

Bridging Text and Structure: How MoCoKGC Revolutionizes Knowledge Graph Completion

Introduction Imagine trying to teach a computer about the world. You might tell it that “Steve Jobs founded Apple.” In a database, this is stored as a triple: (Steve Jobs, founded, Apple Inc.). This structured web of data is what we call a Knowledge Graph (KG). However, these graphs are rarely perfect. They are often missing connections. For example, the graph might know Steve Jobs founded Apple, but it might be missing the link (Apple Inc., headquarters location, Cupertino). ...