Papers

[Modeling User Preferences with Automatic Metrics: Creating a High-Quality Preference Dataset for Machine Translation 🔗](https://arxiv.org/abs/2410.07779)

Bridging the Gap: How Automatic Metrics Can Create Better Human-Aligned Translation Models

Machine translation (MT) has come a long way from the clunky, word-for-word substitutions of the past. Today, Large Language Models (LLMs) can translate with impressive fluency. However, “fluent” doesn’t always mean “perfect.” In many cases, a translation can be grammatically correct but fail to capture the subtle tone, cultural nuance, or specific style a user prefers. This brings us to a significant challenge in modern AI: Alignment. How do we teach a model not just to predict the next word, but to choose the best translation among several valid options? ...

[Modeling Nonnative Sentence Processing with L2 Language Models 🔗](https://aclanthology.org/2024.emnlp-main.283.pdf)

The Bilingual Brain of AI: Can Language Models Simulate Second Language Acquisition?

If you have ever tried to learn a second language (L2) as an adult, you know the struggle. You might know the vocabulary, but you might find yourself instinctively arranging words using the grammar rules of your native language (L1). This phenomenon is known as L1 transfer. For example, a native Spanish speaker might say “the car red” instead of “the red car” because adjectives follow nouns in Spanish. In the world of Natural Language Processing (NLP), researchers are increasingly asking: Can we simulate this cognitive process in machines? Can we build “L2 Language Models” that mimic how non-native speakers process English? ...

[Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding 🔗](https://arxiv.org/abs/2409.19672)

Beyond Linear Text: Rethinking Reading Order in Document AI as a Graph

When we read a novel, the process is straightforward: left to right, top to bottom, line by line. But consider how you read a receipt, a newspaper with multiple columns, or a complex form. You might scan the header, jump to a specific table, read down a column, and then skip to the total at the bottom. This is the challenge of Visually-rich Documents (VrDs). In the field of Document AI, understanding the “Reading Order” is crucial. If a model reads a two-column document straight across the page (crossing the gutter), the sentences become nonsense. ...

[Model-based Preference Optimization in Abstractive Summarization without Human Feedback 🔗](https://arxiv.org/abs/2409.18618)

Hallucination Hunting: How LLMs Can Teach Themselves to Be More Faithful

Large Language Models (LLMs) are incredible writers. They are fluent, creative, and can summarize complex documents in seconds. However, anyone who has used LLMs extensively knows their fatal flaw: hallucination. They often generate text that sounds plausible but contains incorrect or contradictory information. For abstractive summarization—where the goal is factual accuracy and conciseness—hallucination is a dealbreaker. The standard solution to align models with human intent is Reinforcement Learning from Human Feedback (RLHF). While effective, RLHF is expensive, slow, and paradoxically, not always reliable for fact-checking. Humans often prefer summaries that read well over those that are strictly accurate. ...

[Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation 🔗](https://arxiv.org/abs/2406.13663)

Peeking Under the Hood: How to Verify RAG Citations Using Model Internals

Introduction In the rapidly evolving landscape of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become the gold standard for Question Answering. By allowing models to fetch relevant documents from an external database before answering a query, we have significantly reduced—though not eliminated—the problem of “hallucinations.” RAG promises a world where AI responses are grounded in fact rather than statistical probability alone. However, a critical problem remains: Trust. When an LLM provides an answer accompanied by a citation (e.g., “[1]”), we instinctively trust it. We assume the model read Document [1] and derived the answer from it. But often, this is an illusion. Models can hallucinate citations just as easily as they hallucinate facts. They might generate a correct answer based on pre-training memory but slap a random citation on it to satisfy a prompt’s formatting requirements. This phenomenon is known as being “right for the wrong reasons.” ...

[Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue 🔗](https://arxiv.org/abs/2401.04700)

The Hidden Cost of Knowledge: Why Model Editing Breaks LLMs and How to Fix It

Large Language Models (LLMs) like LLaMA and GPT have revolutionized how we interact with information. However, they have a persistent flaw: their knowledge is static. If a model was trained in 2020, it believes the world froze in that year. When the President of the United States changes, or a new scientific discovery is made, the model remains blissfully ignorant, often hallucinating outdated answers. Retraining these massive models from scratch for every new fact is prohibitively expensive and slow. Enter Model Editing—a technique designed to surgically update specific facts within a model’s neural network without a full retrain. It sounds like the perfect solution: efficient, targeted, and fast. ...

[Model Balancing Helps Low-data Training and Fine-tuning 🔗](https://arxiv.org/abs/2410.12178)

Balancing Act — How Layer-Wise Learning Rates Rescue Low-Data Fine-Tuning

Introduction In the current era of Artificial Intelligence, the paradigm of “pre-train, then fine-tune” has become the standard. We take massive Foundation Models (FMs)—whether they are Large Language Models (LLMs) like LLaMA or scientific models—and adapt them to specific tasks. Usually, this works remarkably well. However, there is a catch: fine-tuning typically requires a high-quality, curated dataset. But what happens when that dataset is tiny? In many real-world scenarios, data is scarce. This is particularly true in Scientific Machine Learning (SciML), where generating data points might involve running expensive physical simulations (like fluid dynamics) that take days to complete. When we try to fine-tune large models on just a handful of examples (a “low-data regime”), performance often collapses, becomes unstable, or creates models that fail to generalize. ...

[MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning 🔗](https://arxiv.org/abs/2412.07405)

Mastering Multi-Task LLMs: Inside the MoDULA Architecture

Introduction In the current landscape of Artificial Intelligence, Large Language Models (LLMs) like LLaMA, Qwen, and Yi are becoming the bedrock of modern NLP. However, there is a persistent tension in the development of these models: the tug-of-war between generalization and specialization. We want models that can chat fluently about the weather (generalization) but also solve complex calculus problems or provide accurate medical diagnoses (specialization). Traditionally, achieving this required massive computational resources to fine-tune the entire model, often leading to “catastrophic forgetting”—where learning a new task (like coding) makes the model worse at old tasks (like creative writing). ...

[MoCoKGC: Momentum Contrast Entity Encoding for Knowledge Graph Completion 🔗](https://aclanthology.org/2024.emnlp-main.832.pdf)

Bridging Text and Structure: How MoCoKGC Revolutionizes Knowledge Graph Completion

Introduction Imagine trying to teach a computer about the world. You might tell it that “Steve Jobs founded Apple.” In a database, this is stored as a triple: (Steve Jobs, founded, Apple Inc.). This structured web of data is what we call a Knowledge Graph (KG). However, these graphs are rarely perfect. They are often missing connections. For example, the graph might know Steve Jobs founded Apple, but it might be missing the link (Apple Inc., headquarters location, Cupertino). ...

[Mixture-of-Subspaces in Low-Rank Adaptation 🔗](https://arxiv.org/abs/2406.11909)

Unlocking Hidden Potential in LoRA: The Mixture-of-Subspaces Approach

The scale of modern Large Language Models (LLMs) like GPT-4 and LLaMA 3 is staggering. While their performance is impressive, adapting these giants to specific downstream tasks is a computational nightmare. You simply cannot afford to update all parameters for every new task. This challenge gave rise to Parameter-Efficient Fine-Tuning (PEFT). Among PEFT methods, LoRA (Low-Rank Adaptation) has become the industry standard. It freezes the pre-trained weights and injects trainable low-rank matrices, drastically reducing the number of parameters you need to update. ...

[MIXTURE-OF-SKILLS: Learning to Optimize Data Usage for Fine-Tuning Large Language Models 🔗](https://arxiv.org/abs/2406.08811)

Beyond Heuristics: How Reinforcement Learning Optimizes LLM Fine-Tuning with Mixture-of-Skills

Training a Large Language Model (LLM) is a bit like preparing a meal for a very picky eater. You have a massive pantry of ingredients—datasets containing math problems, coding challenges, medical literature, casual chat logs, and more. The goal is to cook up a model that is proficient in all of these skills. But here lies the challenge: how much of each ingredient do you add? If you add too much coding data, the model might forget how to write poetry. If you drown it in medical texts, it might lose its ability to solve basic math. ...

[MIXTURE-OF-MODULES: REINVENTING TRANSFORMERS AS DYNAMIC ASSEMBLIES OF MODULES 🔗](https://arxiv.org/abs/2407.06677)

Breaking the Stack: How Mixture-of-Modules Reinvents the Transformer

Introduction The Transformer architecture has become the undisputed king of natural language processing. From the original “Attention Is All You Need” paper to the massive Large Language Models (LLMs) of today like GPT-4, the fundamental recipe has remained largely unchanged: a deep stack of identical layers. Data enters at the bottom and is processed sequentially, layer by layer, until it exits at the top. This design relies on a strict “depth-ordered convention.” Layer 5 must always wait for Layer 4, which must wait for Layer 3. But is this rigid hierarchy actually necessary? ...

[Mitigating the Language Mismatch and Repetition Issues in LLM-based Machine Translation via Model Editing 🔗](https://arxiv.org/abs/2410.07054)

Performing Brain Surgery on LLMs to Fix Translation Glitches

Large Language Models (LLMs) like LLaMA and GPT have revolutionized how we approach machine translation (MT). Unlike traditional translation systems that are trained specifically to convert language A to language B, LLMs are “polyglots” by nature. You can simply ask them to translate a sentence, and they usually do a decent job. This capability, known as In-Context Learning (ICL), allows models to translate based on just a few examples or even a simple instruction. ...

[Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics 🔗](https://arxiv.org/abs/2410.10867)

Breaking the Chains of Reference - A Robust, Reference-Free Metric for AI Summarization

Breaking the Chains of Reference: A Robust, Reference-Free Metric for AI Summarization In the rapidly evolving world of Natural Language Processing (NLP), abstractive summarization—the ability of an AI to read a document and write a concise, original summary—remains a “holy grail” task. However, building these systems is only half the battle. The other half, often more treacherous, is evaluating them. How do we know if a summary is actually good? ...

[Mitigating the Alignment Tax of RLHF 🔗](https://arxiv.org/abs/2309.06256)

The Price of Manners: How to Align LLMs Without Making Them Forget

Large Language Models (LLMs) like GPT-4 and Claude are remarkable not just for their ability to generate text, but for their ability to follow instructions and adhere to human values—a process known as alignment. However, there is a hidden cost to this alignment. When we use Reinforcement Learning with Human Feedback (RLHF) to teach a model to be “helpful, honest, and harmless,” it often suffers from catastrophic forgetting. It might become polite, but it suddenly performs worse on translation, reading comprehension, or common sense reasoning. ...

[Mitigating Training Imbalance in LLM Fine-Tuning via Selective Parameter Merging 🔗](https://arxiv.org/abs/2410.03743)

Does Data Order Matter? Improving LLMs with Parameter-Selection Merging

In the world of Large Language Models (LLMs), Supervised Fine-Tuning (SFT) is the standard procedure for adapting a pre-trained base model to a specific task, whether it’s mathematical reasoning, coding, or following instructions. The general consensus has long been that as long as we shuffle our training data and run enough epochs, the model will learn effectively. But what if the order in which the model sees the data matters more than we thought? What if the samples seen at the very beginning of training are consistently learned “worse” than those seen later, creating a hidden imbalance in your model’s performance? ...

[Mitigating Open-Vocabulary Caption Hallucinations 🔗](https://arxiv.org/abs/2312.03631)

Trust Issues in Vision-Language Models: How MOCHa and OpenCHAIR Tackle AI Hallucinations

Image captioning is one of the most fundamental intersections of Computer Vision and Natural Language Processing (NLP). It requires a machine to look at an image and describe it in human language. In recent years, Vision-Language Models (VLMs) like BLIP and GIT have become incredibly fluent, generating detailed and grammatically correct descriptions. But they have a lying problem. In the field of AI, we call this hallucination. This occurs when a model generates details—objects, actions, or attributes—that simply aren’t present in the image. This isn’t just a quirk; it is a critical reliability issue. If an AI describes a “man holding a gun” when he is holding a drill, or a “child on a skateboard” when they are jumping on stairs, the consequences ranges from user frustration to dangerous misinformation. ...

[Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 🔗](https://aclanthology.org/2024.emnlp-main.86.pdf)

Breaking the Echo Chamber: How HiCore Tackles the Matthew Effect in Conversational AI

Have you ever noticed that the more you use a streaming service or a shopping app, the more it seems to recommend the same few popular things? You watch one blockbuster, and suddenly your entire feed is dominated by the “Top 10” list, pushing niche indie films or unique products into obscurity. This phenomenon is known as the Matthew Effect, derived from the biblical adage: “For to every one who has will more be given… but from him who has not, even what he has will be taken away.” In the context of Artificial Intelligence, it means the popular items get more exposure, while the unpopular ones (the long-tail items) get buried. ...

[Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing 🔗](https://arxiv.org/abs/2410.11462)

Can Syntactic Smoothing Fix the Rare Word Problem in LLMs?

Introduction Imagine reading the sentence: “The Golden Gate Bridge has been obnebulated every morning this week, limiting visibility.” Unless you are an avid reader of 19th-century literature, you probably haven’t encountered the word obnebulated before. Yet, you likely understood the sentence perfectly. You know it’s a verb (thanks to the “-ed” suffix and its position after “has been”), and context clues about “visibility” suggest it means something like “clouded” or “fogged.” ...

[Mitigate Extrinsic Social Bias in Pre-trained Language Models via Continuous Prompts Adjustment 🔗](https://aclanthology.org/2024.emnlp-main.620.pdf)

Beyond Manual Word Lists: Debiasing AI with Continuous Prompts

Beyond Manual Word Lists: Debiasing AI with Continuous Prompts Pre-trained Language Models (PLMs) like BERT and RoBERTa have revolutionized Natural Language Processing (NLP). They act as the backbone for everything from sentiment analysis to hate speech detection. However, these models have a significant skeleton in the closet: they inherit human biases present in their massive training datasets. When we deploy these models, they often exhibit “extrinsic social bias”—unfair behavior in specific downstream tasks. For instance, a model might be more likely to classify a tweet as “offensive” simply because it contains African American English (AAE), or associate certain professions more strongly with one gender. ...