EMNLP 2024

[MoCoKGC: Momentum Contrast Entity Encoding for Knowledge Graph Completion 🔗](https://aclanthology.org/2024.emnlp-main.832.pdf)

Bridging Text and Structure: How MoCoKGC Revolutionizes Knowledge Graph Completion

Introduction Imagine trying to teach a computer about the world. You might tell it that “Steve Jobs founded Apple.” In a database, this is stored as a triple: (Steve Jobs, founded, Apple Inc.). This structured web of data is what we call a Knowledge Graph (KG). However, these graphs are rarely perfect. They are often missing connections. For example, the graph might know Steve Jobs founded Apple, but it might be missing the link (Apple Inc., headquarters location, Cupertino). ...

[Mixture-of-Subspaces in Low-Rank Adaptation 🔗](https://arxiv.org/abs/2406.11909)

Unlocking Hidden Potential in LoRA: The Mixture-of-Subspaces Approach

The scale of modern Large Language Models (LLMs) like GPT-4 and LLaMA 3 is staggering. While their performance is impressive, adapting these giants to specific downstream tasks is a computational nightmare. You simply cannot afford to update all parameters for every new task. This challenge gave rise to Parameter-Efficient Fine-Tuning (PEFT). Among PEFT methods, LoRA (Low-Rank Adaptation) has become the industry standard. It freezes the pre-trained weights and injects trainable low-rank matrices, drastically reducing the number of parameters you need to update. ...

[MIXTURE-OF-SKILLS: Learning to Optimize Data Usage for Fine-Tuning Large Language Models 🔗](https://arxiv.org/abs/2406.08811)

Beyond Heuristics: How Reinforcement Learning Optimizes LLM Fine-Tuning with Mixture-of-Skills

Training a Large Language Model (LLM) is a bit like preparing a meal for a very picky eater. You have a massive pantry of ingredients—datasets containing math problems, coding challenges, medical literature, casual chat logs, and more. The goal is to cook up a model that is proficient in all of these skills. But here lies the challenge: how much of each ingredient do you add? If you add too much coding data, the model might forget how to write poetry. If you drown it in medical texts, it might lose its ability to solve basic math. ...

[MIXTURE-OF-MODULES: REINVENTING TRANSFORMERS AS DYNAMIC ASSEMBLIES OF MODULES 🔗](https://arxiv.org/abs/2407.06677)

Breaking the Stack: How Mixture-of-Modules Reinvents the Transformer

Introduction The Transformer architecture has become the undisputed king of natural language processing. From the original “Attention Is All You Need” paper to the massive Large Language Models (LLMs) of today like GPT-4, the fundamental recipe has remained largely unchanged: a deep stack of identical layers. Data enters at the bottom and is processed sequentially, layer by layer, until it exits at the top. This design relies on a strict “depth-ordered convention.” Layer 5 must always wait for Layer 4, which must wait for Layer 3. But is this rigid hierarchy actually necessary? ...

[Mitigating the Language Mismatch and Repetition Issues in LLM-based Machine Translation via Model Editing 🔗](https://arxiv.org/abs/2410.07054)

Performing Brain Surgery on LLMs to Fix Translation Glitches

Large Language Models (LLMs) like LLaMA and GPT have revolutionized how we approach machine translation (MT). Unlike traditional translation systems that are trained specifically to convert language A to language B, LLMs are “polyglots” by nature. You can simply ask them to translate a sentence, and they usually do a decent job. This capability, known as In-Context Learning (ICL), allows models to translate based on just a few examples or even a simple instruction. ...

[Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics 🔗](https://arxiv.org/abs/2410.10867)

Breaking the Chains of Reference - A Robust, Reference-Free Metric for AI Summarization

Breaking the Chains of Reference: A Robust, Reference-Free Metric for AI Summarization In the rapidly evolving world of Natural Language Processing (NLP), abstractive summarization—the ability of an AI to read a document and write a concise, original summary—remains a “holy grail” task. However, building these systems is only half the battle. The other half, often more treacherous, is evaluating them. How do we know if a summary is actually good? ...

[Mitigating the Alignment Tax of RLHF 🔗](https://arxiv.org/abs/2309.06256)

The Price of Manners: How to Align LLMs Without Making Them Forget

Large Language Models (LLMs) like GPT-4 and Claude are remarkable not just for their ability to generate text, but for their ability to follow instructions and adhere to human values—a process known as alignment. However, there is a hidden cost to this alignment. When we use Reinforcement Learning with Human Feedback (RLHF) to teach a model to be “helpful, honest, and harmless,” it often suffers from catastrophic forgetting. It might become polite, but it suddenly performs worse on translation, reading comprehension, or common sense reasoning. ...

[Mitigating Training Imbalance in LLM Fine-Tuning via Selective Parameter Merging 🔗](https://arxiv.org/abs/2410.03743)

Does Data Order Matter? Improving LLMs with Parameter-Selection Merging

In the world of Large Language Models (LLMs), Supervised Fine-Tuning (SFT) is the standard procedure for adapting a pre-trained base model to a specific task, whether it’s mathematical reasoning, coding, or following instructions. The general consensus has long been that as long as we shuffle our training data and run enough epochs, the model will learn effectively. But what if the order in which the model sees the data matters more than we thought? What if the samples seen at the very beginning of training are consistently learned “worse” than those seen later, creating a hidden imbalance in your model’s performance? ...

[Mitigating Open-Vocabulary Caption Hallucinations 🔗](https://arxiv.org/abs/2312.03631)

Trust Issues in Vision-Language Models: How MOCHa and OpenCHAIR Tackle AI Hallucinations

Image captioning is one of the most fundamental intersections of Computer Vision and Natural Language Processing (NLP). It requires a machine to look at an image and describe it in human language. In recent years, Vision-Language Models (VLMs) like BLIP and GIT have become incredibly fluent, generating detailed and grammatically correct descriptions. But they have a lying problem. In the field of AI, we call this hallucination. This occurs when a model generates details—objects, actions, or attributes—that simply aren’t present in the image. This isn’t just a quirk; it is a critical reliability issue. If an AI describes a “man holding a gun” when he is holding a drill, or a “child on a skateboard” when they are jumping on stairs, the consequences ranges from user frustration to dangerous misinformation. ...

[Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 🔗](https://aclanthology.org/2024.emnlp-main.86.pdf)

Breaking the Echo Chamber: How HiCore Tackles the Matthew Effect in Conversational AI

Have you ever noticed that the more you use a streaming service or a shopping app, the more it seems to recommend the same few popular things? You watch one blockbuster, and suddenly your entire feed is dominated by the “Top 10” list, pushing niche indie films or unique products into obscurity. This phenomenon is known as the Matthew Effect, derived from the biblical adage: “For to every one who has will more be given… but from him who has not, even what he has will be taken away.” In the context of Artificial Intelligence, it means the popular items get more exposure, while the unpopular ones (the long-tail items) get buried. ...

[Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing 🔗](https://arxiv.org/abs/2410.11462)

Can Syntactic Smoothing Fix the Rare Word Problem in LLMs?

Introduction Imagine reading the sentence: “The Golden Gate Bridge has been obnebulated every morning this week, limiting visibility.” Unless you are an avid reader of 19th-century literature, you probably haven’t encountered the word obnebulated before. Yet, you likely understood the sentence perfectly. You know it’s a verb (thanks to the “-ed” suffix and its position after “has been”), and context clues about “visibility” suggest it means something like “clouded” or “fogged.” ...

[Mitigate Extrinsic Social Bias in Pre-trained Language Models via Continuous Prompts Adjustment 🔗](https://aclanthology.org/2024.emnlp-main.620.pdf)

Beyond Manual Word Lists: Debiasing AI with Continuous Prompts

Beyond Manual Word Lists: Debiasing AI with Continuous Prompts Pre-trained Language Models (PLMs) like BERT and RoBERTa have revolutionized Natural Language Processing (NLP). They act as the backbone for everything from sentiment analysis to hate speech detection. However, these models have a significant skeleton in the closet: they inherit human biases present in their massive training datasets. When we deploy these models, they often exhibit “extrinsic social bias”—unfair behavior in specific downstream tasks. For instance, a model might be more likely to classify a tweet as “offensive” simply because it contains African American English (AAE), or associate certain professions more strongly with one gender. ...

[MisinfoEval: Generative AI in the Era of 'Alternative Facts' 🔗](https://arxiv.org/abs/2410.09949)

Can AI Fix Fake News? Inside MisinfoEval and the Power of Personalized Fact-Checking

Introduction In the span of a single decade, the architecture of information consumption has fundamentally changed. We have moved from an era of curated news broadcasts to one of algorithmic “filter bubbles,” where social media feeds reinforce our existing beliefs and insulate us from opposing viewpoints. This environment has proven to be a fertile breeding ground for misinformation—sensational, often false stories that spread faster and farther than the truth. The consequences are not merely academic; they threaten democratic processes, public health, and economic stability. Traditionally, platforms have tried to combat this using what researchers call a “knowledge deficit” model. The assumption is simple: if you give people the facts, they will correct their views. Platforms apply “False” tags or link to Snopes articles, hoping that critical thinking will kick in. ...

[MIRRORSTORIES: Reflecting Diversity through Personalized Narrative Generation with Large Language Models 🔗](https://arxiv.org/abs/2409.13935)

Can AI Write Your Life Story? How MIRRORSTORIES Is Personalizing Literature

Introduction: The Agony of the Untold Story Maya Angelou once wrote, “There is no greater agony than bearing an untold story inside you.” For millions of readers, this agony is compounded by a lack of representation. When you open a book, you are looking for a reflection—a character who looks like you, lives like you, and faces struggles you understand. These are called “mirror books.” They validate identity, foster belonging, and significantly improve reading engagement, especially in education. ...

[MiniConGTS: A Near Ultimate Minimalist Contrastive Grid Tagging Scheme for Aspect Sentiment Triplet Extraction 🔗](https://arxiv.org/abs/2406.11234)

Less is More: How MiniConGTS Revolutionizes Sentiment Analysis with Minimalism and Contrastive Learning

Introduction In the world of Natural Language Processing (NLP), sentiment analysis has evolved far beyond simply classifying a movie review as “positive” or “negative.” Today, we deal with complex sentences where multiple opinions about different things exist simultaneously. Consider the sentence: “The food was delicious, but the service was terrible.” A simple “neutral” label would be misleading. We need to know what was good (food) and what was bad (service). ...

[MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents 🔗](https://arxiv.org/abs/2404.10774)

MiniCheck: GPT-4 Level Fact-Checking for a Fraction of the Cost

Introduction Large Language Models (LLMs) have revolutionized how we interact with information, from summarizing complex reports to answering open-ended questions. However, they suffer from a persistent and well-known flaw: hallucination. An LLM can confidently generate a statement that sounds plausible but is factually incorrect. To mitigate this, the industry has largely adopted Retrieval-Augmented Generation (RAG). In a RAG setup, the model is provided with “grounding documents”—trusted sources of evidence—and asked to generate an answer based solely on that evidence. While this helps, it does not solve the problem entirely. Models can still misinterpret the documents, blend information incorrectly, or hallucinate details not found in the text. ...

[MiddleWare for LLMs: Tools Are Instrumental for Language Agents in Complex Environments 🔗](https://arxiv.org/abs/2402.14672)

Why LLMs Need Middleware: Bridging the Gap Between Agents and Massive Data

Introduction We have entered an era where Large Language Models (LLMs) like GPT-4 possess a human-like mastery over text. They can summarize articles, write code, and chat fluently. However, the ambition of Artificial Intelligence researchers extends far beyond processing text. The ultimate goal is to create generalist agents: AI that can not only talk but act within the real world to solve complex tasks. Imagine asking an AI to “find the average revenue of all tech companies founded after 2010 based on our internal database” or “map the relationships between every Nobel Prize winner in Physics and their doctoral advisors.” ...

[MiTTenS: A Dataset for Measuring Gender Mistranslation Harms 🔗](https://arxiv.org/abs/2401.06935)

Lost in Translation: How We Measure Gender Bias in the Age of Foundation Models

Imagine reading a story in Bengali about your aunt. The text says, “Sarah is my aunt. I really like her jokes.” You paste this into a translation tool to share it with an English-speaking friend. The output reads: “Sarah is my aunt. I really like his jokes.” In an instant, the identity of the subject is erased and replaced. While this might seem like a minor grammatical slip, these errors—known as gender mistranslations—can cause significant representational harm. They reinforce stereotypes (e.g., assuming all doctors are male) and can misgender individuals in sensitive contexts. ...

[Metrics for What, Metrics for Whom: Assessing Actionability of Bias Evaluation Metrics in NLP 🔗](https://aclanthology.org/2024.emnlp-main.1207.pdf)

Bias Metrics in NLP Are Broken: Why Actionability Is the Missing Piece

Imagine you are a Machine Learning engineer responsible for deploying a large language model (LLM) for a hiring platform. You run a standard bias evaluation script, and it returns a score: 0.42. What do you do now? Is 0.42 good? Is it terrible? Does it mean the model hates women, or that it slightly prefers Western names? If you fix the data and the score drops to 0.38, is the model safe to deploy? ...

[Methods for Automatic Matrix Language Determination of Code-Switched Speech 🔗](https://arxiv.org/abs/2410.02521)

Decoding the Matrix - How AI Determines the Dominant Grammar in Code-Switched Speech

Imagine you are listening to a conversation in Singapore. You might hear a sentence like: “I thought all trains 都是 via Jurong East 去到 Pasir Ris.” To a monolingual speaker, this is chaos. To a bilingual speaker, it’s perfectly natural. This phenomenon is known as Code-Switching (CS)—the fluid alternation between two or more languages in a single conversation. ...