EMNLP 2024

[Lifelong Knowledge Editing for LLMs with Retrieval-Augmented Continuous Prompt Learning 🔗](https://arxiv.org/abs/2405.03279)

Can LLMs Learn Forever? Inside RECIPE, the New Standard for Lifelong Model Editing

Imagine you have trained a state-of-the-art Large Language Model (LLM). It speaks fluent English, codes in Python, and understands complex reasoning. But there is a problem: it believes the Prime Minister of the UK is still Boris Johnson, or it doesn’t know about a major geopolitical event that happened yesterday. This is the “static knowledge” problem. Once an LLM is trained, its knowledge is frozen in time. Retraining these massive models from scratch every time a fact changes is financially and computationally impossible. This has led to the rise of Model Editing—techniques designed to surgical update specific facts in an LLM without breaking its general capabilities. ...

[Lifelong Event Detection via Optimal Transport 🔗](https://arxiv.org/abs/2410.08905)

How Optimal Transport Stops AI from Forgetting - A Deep Dive into LEDOT

Imagine you are trying to learn a new language. You spend months mastering French. Then, you decide to learn Spanish. But here is the catch: as soon as you start conjugating Spanish verbs, you inexplicably forget every French word you ever learned. This phenomenon is known as Catastrophic Forgetting, and it is one of the biggest hurdles in Artificial Intelligence today. In the world of Natural Language Processing (NLP), we want models that can learn continuously—picking up new tasks without erasing their memory of old ones. This is especially tricky in Continual Event Detection (CED), where a model must identify specific types of events (like “Attacks,” “Elections,” or “Transactions”) in text streams that change over time. ...

[Lexically Grounded Subword Segmentation 🔗](https://arxiv.org/abs/2406.13560)

Bringing Meaning Back to Tokenization: A Lexically Grounded Approach

In the world of Natural Language Processing (NLP), we often marvel at the sophisticated architectures of Large Language Models (LLMs) like the Transformer. We analyze attention mechanisms, feed-forward networks, and massive parameter counts. Yet, we frequently overlook the humble “front door” of these models: Tokenization. Standard tokenization methods, like Byte-Pair Encoding (BPE) or SentencePiece (Unigram), are the industry standard. They are statistical powerhouses, designed to compress text efficiently and limit vocabulary size. However, they have a major flaw: they don’t actually understand the words they are breaking apart. They split words based on frequency, not meaning. ...

[Leveraging pre-trained language models for linguistic analysis: A case of argument structure constructions 🔗](https://aclanthology.org/2024.emnlp-main.415.pdf)

Can AI Solve the Ambiguity of Language? RoBERTa vs. GPT-4 in Linguistic Analysis

Language is a tricky beast. Consider these two sentences: She ran to the mountains. She ran in the mountains. Syntactically, they look almost identical. They both follow a “Subject + Verb + Prepositional Phrase” structure. A basic parser might look at these and see the exact same tree: a noun, a verb, and a modifier. But as a human reader, you know they mean fundamentally different things. The first sentence describes motion toward a goal; the prepositional phrase “to the mountains” is an argument required to complete the meaning of the movement. The second sentence describes an activity happening in a location; “in the mountains” just sets the scene. ...

[Leveraging Large Language Models for NLG Evaluation: Advances and Challenges 🔗](https://arxiv.org/abs/2401.07103)

The Judge is an AI: How LLMs are Revolutionizing Text Evaluation

Introduction In the world of Artificial Intelligence, we have witnessed a massive shift in how machines write. From the early days of clunky chatbots to the fluent, creative prose of models like GPT-4 and LLaMA, Natural Language Generation (NLG) has advanced at breakneck speed. But this progress has birthed a new, perplexing problem: How do we know if what the AI wrote is actually “good”? For years, researchers relied on rigid metrics that counted how many words overlapped between an AI’s output and a human’s reference text. If the AI used the word “happy” and the human used “joyful,” traditional metrics penalized the AI. This approach fails to capture the nuance, creativity, and semantic depth of modern language models. ...

[Leveraging Estimated Transferability Over Human Intuition for Model Selection in Text Ranking 🔗](https://arxiv.org/abs/2409.16198)

Beyond Intuition: How AiRTran Solves the Model Selection Crisis in Text Ranking

In the modern era of Natural Language Processing (NLP), we are spoiled for choice. If you open the Hugging Face model hub today, you are greeted with hundreds of thousands of models. For a student or a practitioner trying to build a text ranking system—like a search engine or a RAG (Retrieval-Augmented Generation) pipeline—this abundance creates a paradox. Which model should you choose? Should you use a BERT variant? A RoBERTa clone? A specialized biomedical model? ...

[Leveraging Context-Aware Prompting for Commit Message Generation 🔗](https://aclanthology.org/2024.emnlp-main.749.pdf)

Beyond the Diff: How Graph-Based Context Improves Auto-Generated Commit Messages

If you are a software developer, or studying to become one, you are likely familiar with the “Friday afternoon commit.” You’ve just finished a complex bug fix, you’re tired, and the last thing you want to do is write a detailed explanation of why you changed those ten lines of code. You type git commit -m "fix bug" and call it a day. While understandable, this habit creates a nightmare for code maintainability. Commit messages are the historical record of a software project. They explain the intent behind changes, making it possible for future developers (or your future self) to understand the evolution of the codebase without re-reading every line of code. ...

[Leveraging Conflicts in Social Media Posts: Unintended Offense Dataset 🔗](https://aclanthology.org/2024.emnlp-main.259.pdf)

Beyond Slurs: Teaching AI to Detect Unintended Offense in Social Media

Introduction Imagine you are scrolling through Twitter (now X). You see a thread where User A makes a comment about diet and exercise. It seems harmless enough. But then, User B replies angrily, claiming User A is body-shaming them. User A, confused, replies, “I didn’t mean to offend you; I was just sharing what my doctor told me.” In the world of Natural Language Processing (NLP), detecting the offense in User A’s original post is incredibly difficult. It doesn’t contain swear words, racial slurs, or explicit threats. The offense is unintended and implicit, relying entirely on the context and the receiver’s interpretation. ...

[Leveraging BERT and TFIDF Features for Short Text Clustering via Alignment-Promoting Co-Training 🔗](https://aclanthology.org/2024.emnlp-main.828.pdf)

Best of Both Worlds: Unifying BERT and TFIDF for Superior Short Text Clustering

Introduction In the world of Natural Language Processing (NLP), we often view progress as a straight line: we move from Bag-of-Words to Word2Vec, and then to Transformers like BERT. The assumption is usually that the newer model renders the older technique obsolete. Why count words with TFIDF when BERT can understand deep contextual semantics? However, when it comes to Short Text Clustering—grouping tweets, news headlines, or Q&A titles without labels—BERT has a blind spot. While it is excellent at understanding general language, it often misses the significance of rare, domain-specific keywords. Conversely, the “outdated” TFIDF method is excellent at spotting these keywords but fails to grasp context. ...

[Let’s discuss! Quality Dimensions and Annotated Datasets for Computational Argument Quality Assessment 🔗](https://aclanthology.org/2024.emnlp-main.1155.pdf)

Decoding Persuasion: A Deep Dive into Computational Argument Quality Assessment

Introduction In democratic societies, argumentation is the bedrock of decision-making. Whether it is a politician advocating for policy change, a student writing a persuasive essay, or a user on a forum trying to change another’s view, the ability to argue effectively is a key competence. For years, the field of Natural Language Processing (NLP) has focused heavily on Argument Mining (AM)—the task of teaching computers to simply find arguments within a text. AM algorithms can scan a document and identify premises and conclusions. But identifying an argument is only half the battle. The far more complex challenge is determining how good that argument actually is. ...

[Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models 🔗](https://arxiv.org/abs/2407.01906)

Precision Surgery for LLMs: How Expert-Specialized Fine-Tuning Revolutionizes MoE Adaptation

The landscape of Large Language Models (LLMs) is currently defined by two conflicting forces: the drive for massive scale and the constraint of limited computational resources. We want models that know everything, but we don’t always have the hardware to train them. This has led to the explosion of Parameter-Efficient Fine-Tuning (PEFT) techniques. Methods like LoRA (Low-Rank Adaptation) have become household names for students and practitioners, allowing us to adapt giant dense models like Llama-3 or Mistral on consumer-grade hardware. ...

[Let Me Teach You: Pedagogical Foundations of Feedback for Language Models 🔗](https://arxiv.org/abs/2307.00279)

From Training to Teaching: Applying Pedagogical Science to LLM Feedback

The way we train Large Language Models (LLMs) is evolving. In the early days, it was all about next-token prediction on massive datasets. Then came the era of alignment, where we started telling models what we actually wanted them to do, primarily through Reinforcement Learning from Human Feedback (RLHF). But if you look closely at how we “teach” these models, it feels surprisingly primitive compared to how humans teach each other. In RLHF, we often treat the model like a black box that spits out two answers, and we simply tell it, “Answer A is better than Answer B.” ...

[Less is More: Parameter-Efficient Selection of Intermediate Tasks for Transfer Learning 🔗](https://arxiv.org/abs/2410.15148)

Solving the Transfer Learning Paradox: How Embedding Space Maps Find the Perfect Task in Seconds

In the world of Natural Language Processing (NLP), we are currently living in an era of abundance. We have massive pre-trained models like BERT and RoBERTa, and we have platforms like the HuggingFace Hub hosting hundreds of thousands of datasets. Theoretically, this is a goldmine. If you are building a model to detect emotions in tweets but have very little labeled data, you shouldn’t just fine-tune a raw BERT model. Instead, you should look for a “stepping stone”—an intermediate task. Perhaps fine-tuning BERT on a movie review sentiment dataset first, and then fine-tuning on your tweet emotion data, would yield better results. ...

[Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA 🔗](https://arxiv.org/abs/2406.17419)

Beyond the Haystack: Why Current Long-Context LLMs Fail at Real-World Multi-Document Tasks

The race for longer context windows in Large Language Models (LLMs) has been one of the defining trends of the last year. We have moved rapidly from models that could read a few pages to models like Gemini-1.5-Pro and GPT-4o, which boast context windows of 128k, 200k, or even 1 million tokens. Theoretically, this allows an AI to ingest hundreds of financial reports, legal contracts, or academic papers simultaneously and answer complex questions about them. ...

[Learning to Write Rationally: How Information Is Distributed in Non-Native Speakers’ Essays 🔗](https://arxiv.org/abs/2411.03550)

Decoding the Non-Native Mind: How Do We Distribute Information When Learning a New Language?

Introduction: The Invisible Rhythm of Communication Imagine you are trying to explain a complex concept to a friend. You don’t just blurting out a random string of high-density keywords. Instead, you pace yourself. You mix complicated terms with simpler explanations; you structure your sentences so that the listener can predict where you are going. This instinctive pacing is what linguists call Information Distribution. In our native language, we do this naturally. We smooth out the “bumps” in conversation to make sure we are understood. But what happens when we write in a language we are still learning? Do we lose this rhythm? Do we overwhelm the reader, or do we play it too safe? ...

[Learning to Retrieve Iteratively for In-Context Learning 🔗](https://arxiv.org/abs/2406.14739)

Beyond Top-K: Building Better Prompts with Iterative Retrieval and Reinforcement Learning

Introduction In the era of Large Language Models (LLMs), In-Context Learning (ICL) has become a dominant paradigm. The idea is deceptively simple: instead of fine-tuning a model’s weights, you simply provide a few examples (exemplars) in the prompt, and the model learns the pattern. For example, if you want an LLM to translate English to SQL, your prompt might look like this: Input: Show me users over 20. Output: SELECT * FROM users WHERE age > 20; ...

[Learning to Rank Salient Content for Query-focused Summarization 🔗](https://arxiv.org/abs/2411.00324)

Ranking Matters - How Learning-to-Rank Improves Query-Focused Summarization

Introduction Imagine you are handed a 50-page transcript of a corporate meeting and asked a single question: “Why did the marketing team disagree with the engineering team about the budget?” To answer this, you wouldn’t summarize the entire meeting. You wouldn’t care about the opening pleasantries, the coffee break discussions, or the unrelated IT updates. You would scan the document, identify the specific segments where marketing and engineering discussed finances, rank them by importance, and then synthesize an answer. ...

[Learning to Extract Structured Entities Using Language Models 🔗](https://arxiv.org/abs/2402.04437)

Beyond Triplets: Revolutionizing Information Extraction with Structured Entities and MuSEE

In the vast ocean of unstructured text on the internet—Wikipedia pages, news articles, financial reports—lies a treasure trove of data waiting to be organized. For years, the field of Information Extraction (IE) has been the miner of this digital age, digging through paragraphs to find relationships between things. Traditionally, this has been done by hunting for “triplets”: a Subject, a Relation, and an Object (e.g., Bill Gates, Co-founder, Microsoft). While effective, this approach has limits. It treats information as a bag of disconnected facts rather than a cohesive profile of an entity. ...

[Learning to Correct for QA Reasoning with Black-box LLMs 🔗](https://arxiv.org/abs/2406.18695)

CoBB: How to Fix Black-Box LLM Errors Without Accessing Weights

Introduction In the current landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4, Claude, and Gemini have become ubiquitous. They possess incredible reasoning capabilities, yet they remain prone to hallucinations, biases, and reasoning errors. For researchers and engineers, the standard solution to these errors is usually fine-tuning or steering the model. However, a significant barrier exists: most state-of-the-art models are black boxes. We interact with them via APIs, sending a prompt and receiving text. We do not have access to the model’s weights, gradients, or often even the output token probabilities (logits). This opacity makes traditional adaptation methods—which rely on accessing internal model states—impossible. ...

[Learning from Natural Language Explanations for Generalizable Entity Matching 🔗](https://arxiv.org/abs/2406.09330)

Can Explanations Teach Small Models to Generalize? A Deep Dive into Entity Matching

Imagine you are a data scientist at a massive e-commerce aggregator. You have a database of products from Amazon and another from Google Shopping. Your task is to merge them. On one side, you have a record: iPhone 13, 128GB, Midnight. On the other side: Apple iPhone 13 - Black - 128 GB Storage. To a human, these are obviously the same product. But to a machine, they are just strings of characters. This is the problem of Entity Matching (EM)—identifying when different records refer to the same real-world entity. ...