Papers

[CorrSynth - A Correlated Sampling Method for Diverse Dataset Generation from LLMs 🔗](https://arxiv.org/abs/2411.08553)

Solving the Diversity Crisis in Synthetic Data: A Deep Dive into CorrSynth

The era of Large Language Models (LLMs) has revolutionized how we approach machine learning. We have moved from a scarcity mindset—where labeled data was expensive and rare—to an abundance mindset, where models like GPT-4 or Mixtal can generate infinite amounts of text. This has given rise to Knowledge Distillation: using a massive “Teacher” LLM to generate synthetic datasets, which are then used to train smaller, efficient “Student” models (like BERT or DistilBERT) for specific tasks. ...

[COPYBENCH: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation 🔗](https://arxiv.org/abs/2407.07087)

Beyond Verbatim: Uncovering How LLMs Copy Plots and Characters

Introduction In the rapidly evolving landscape of Generative AI, a major legal and ethical storm has been brewing around copyright. We know that Large Language Models (LLMs) are trained on massive datasets that include copyrighted books, articles, and creative writing. A central question for researchers, lawyers, and content creators is: To what extent do these models reproduce protected content? Until recently, the community has largely focused on literal copying—instances where an AI spits out a passage of text word-for-word, identical to the source material. It is relatively easy to check if a model generates the exact opening paragraph of Harry Potter. However, this narrow focus misses a crucial nuance of copyright law and creative expression. Infringement isn’t just about the exact sequence of words; it is also about the “pattern of the work”—the unique arrangement of plots, events, and characters. ...

[Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment 🔗](https://arxiv.org/abs/2402.19085)

Taming the Alignment Tax: How Controllable Preference Optimization Balances Helpfulness, Honesty, and Harmlessness

Taming the Alignment Tax: How Controllable Preference Optimization Balances Helpfulness, Honesty, and Harmlessness If you have used Large Language Models (LLMs) extensively, you have likely encountered the “refusal” phenomenon. You ask a model for help with a complex topic, perhaps something strictly factual but slightly sensitive, and it politely declines or gives a watered-down, overly cautious answer. This is often the result of safety alignment. To make AI safe for public use, we align models with human values, typically summarized as the “3H” principles: Helpfulness, Honesty, and Harmlessness. Ideally, we want a model that is perfect at all three. In reality, these goals often conflict. A model that is perfectly harmless might refuse to answer valid questions (reducing helpfulness). A model that is perfectly helpful might answer dangerous questions (reducing harmlessness). ...

[ControlMath: Controllable Data Generation Promotes Math Generalist Models 🔗](https://arxiv.org/abs/2409.15376)

Beyond Rote Memorization: How ControlMath Teaches LLMs to Actually Understand Math

Large Language Models (LLMs) are incredible conversationalists, poets, and coders. Yet, when you ask them to solve a unique math problem—one that isn’t a carbon copy of a textbook example they’ve seen a million times—they often stumble. This is the current frontier in AI research: moving from in-domain proficiency (solving problems similar to training data) to out-of-domain generalization (solving truly novel problems). Today, we are diving deep into a paper titled “ControlMath: Controllable Data Generation Promotes Math Generalist Models”. This research introduces a fascinating pipeline that doesn’t just feed the model more data, but constructs better data from scratch. By generating mathematical equations first and wrapping them in language second, the authors have created a way to build “Math Generalist” models that understand the underlying logic rather than just memorizing patterns. ...

[Control Large Language Models via Divide and Conquer 🔗](https://arxiv.org/abs/2410.04628)

Why LLMs Ignore Your Instructions (And How to Fix It with Divide and Conquer)

If you have ever tried to use a Large Language Model (LLM) like GPT-4 or LLaMA for a strict data-processing task, you have likely encountered a frustrating phenomenon. You provide a prompt with a list of specific requirements—perhaps ten different demographic facts that must appear in a generated user profile—and the model confidently produces a fluent, professional-sounding paragraph. But when you check the details, something is wrong. It mentioned the name and the occupation but forgot the credit score. Or perhaps it hallucinated the location. ...

[Contribution of Linguistic Typology to Universal Dependency Parsing: An Empirical Investigation 🔗](https://aclanthology.org/2024.emnlp-main.773.pdf)

When Linguistics Meets AI: Can Typology Improve Dependency Parsing?

Natural Language Processing (NLP) often exists at a crossroads between engineering and linguistics. On one side, we have models designed to process text efficiently; on the other, we have the deep, complex theories of how human language actually works. One of the most successful attempts to bridge these worlds is Universal Dependencies (UD). UD is a massive global initiative to create a standard way of annotating the grammar (syntax) of all human languages. If you are training a parser to understand sentence structure, UD is likely the framework you are using. However, despite its name, Universal Dependencies has faced criticism for not being truly “universal” in the eyes of linguistic typologists—scholars who study the structural similarities and differences across languages. ...

[Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion 🔗](https://arxiv.org/abs/2406.19185)

Beyond Preferences: How Contrastive Policy Gradient Unlocks Arbitrary Reward Optimization for LLMs

Reinforcement Learning from Human Feedback (RLHF) has become the standard for turning raw Large Language Models (LLMs) into helpful assistants. If you have played with ChatGPT or Llama, you are interacting with a model that has likely undergone this process. Traditionally, this involves a two-step dance: training a reward model to mimic human preferences, and then using Reinforcement Learning (usually PPO) to optimize the LLM against that reward. However, PPO is computationally expensive and notoriously unstable because it requires “on-policy” sampling—generating new text from the model constantly during training. ...

[Contrastive Entity Coreference and Disambiguation for Historical Texts 🔗](https://arxiv.org/abs/2406.15576)

Unlocking History with AI: How Bi-Encoders and Hard Negatives Resolve Ambiguity in Historical News

Imagine you are a historian analyzing the political climate of the 1960s. You have digitized millions of newspaper pages from that era. You want to track the media coverage of “John Kennedy.” It sounds simple, but computationally, it is a nightmare. In one article, “John Kennedy” refers to the 35th U.S. President. In another, published in a local Louisiana paper, “John Kennedy” refers to a state politician. In a third, he is referred to simply as “Jack.” To a human reader, context clues usually make the distinction clear. To a machine, these are just strings of characters. ...

[Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech 🔗](https://arxiv.org/abs/2406.11064)

Can You Hear Me Now? solving ASR Domain Shifts with Fast-Slow Adaptation

Introduction We have all experienced the frustration of voice assistants failing us. In a quiet living room, they understand us perfectly. But try dictating a message while walking past a construction site or sitting in a noisy café, and the system crumbles. This phenomenon is known as domain shift. Deep learning models for Automatic Speech Recognition (ASR) are typically trained on relatively clean, controlled data. When deployed in the real world, they encounter “out-of-domain” samples—noisy environments they weren’t prepared for. ...

[Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation 🔗](https://arxiv.org/abs/2406.01806)

Trust Issues with LLMs? How Attention Can Fix Confidence Scores

Introduction We are currently witnessing a massive deployment of Large Language Models (LLMs) across every conceivable domain—from writing code and summarizing emails to diagnosing medical conditions and analyzing financial data. However, for all their fluency, LLMs have a well-documented reliability problem: they hallucinate. They can confidently state falsehoods with the same authority as they state facts. This creates a critical need for uncertainty quantification. Before we act on an LLM’s advice, we need to know: How confident is the model in its own answer? ...

[Context-aware Watermark with Semantic Balanced Green-red Lists for Large Language Models 🔗](https://aclanthology.org/2024.emnlp-main.1260.pdf)

Making AI Watermarks Unbreakable: A Semantic Approach to Robustness and Quality

In the rapidly evolving landscape of Large Language Models (LLMs), a new challenge has emerged alongside the impressive capabilities of tools like GPT-4 and Llama: provenance. How do we know if a text was written by a human or generated by a machine? This isn’t just a matter of academic curiosity; it has profound implications for plagiarism, misinformation, and copyright. The leading solution to this problem is watermarking. Unlike a visible logo on an image, a text watermark is an invisible statistical pattern embedded in the word choices of an LLM. While existing methods have made great strides, they suffer from two major flaws: they are often brittle (easy to remove by simply rewriting the text) and they can degrade the quality of the writing, making the AI sound unnatural. ...

[Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models 🔗](https://arxiv.org/abs/2408.08470)

The Right Tool for the Job: Accelerating LLMs with Context-Aware Assistant Selection

Introduction We are currently in the golden age of Large Language Models (LLMs). From GPT-4 to Llama 3, these models act as reasoning engines capable of astonishingly human-like behavior. However, there is a persistent bottleneck that every developer, student, and researcher faces: latency. The core issue lies in auto-regressive decoding. To generate a sentence, an LLM must predict one token, append it to the sequence, feed it back into itself, and predict the next. For a 100-token response, the model must run its entire massive architecture 100 times sequentially. This process under-utilizes modern GPUs, which thrive on parallel processing, and makes real-time applications expensive and sluggish. ...

[Context-Aware Adapter Tuning for Few-Shot Relation Learning in Knowledge Graphs 🔗](https://arxiv.org/abs/2410.09123)

Breaking the Mold: How RelAdapter Customizes Few-Shot Learning for Knowledge Graphs

Knowledge Graphs (KGs) are the silent engines behind many of the AI applications we use daily. From search engines to recommendation systems, KGs structure real-world facts into triplets: (Head Entity, Relation, Tail Entity). For example, (Leonardo da Vinci, painted, Mona Lisa). However, KGs suffer from a chronic problem: incompleteness. While common relations like “born in” have millions of examples, many specific or newly emerging relations—such as a specific biological interaction or a new corporate acquisition type—have very few recorded instances. This is where Few-Shot Relation Learning (FSRL) comes into play. FSRL aims to predict new facts for a relation given only a handful of examples (the “shots”). ...

[Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing 🔗](https://arxiv.org/abs/2404.11791)

Best of Both Worlds - How to Fix LLM Scoring for Search Ranking

Introduction Large Language Models (LLMs) have revolutionized how we process text, and naturally, they are reshaping Information Retrieval (IR). When you search for something, you want the best results at the top (Ranking) and you want to know how relevant those results actually are (Relevance Prediction). In the current research landscape, there is a dichotomy in how LLMs are used for these tasks. You can ask an LLM to rate a document directly (e.g., “Rate this from 1 to 5”), which gives you a meaningful score but often a mediocre ranking order. Alternatively, you can ask the LLM to compare documents (e.g., “Is Document A better than Document B?”). This “Pairwise” approach produces excellent ranking orders but results in arbitrary scores that don’t tell you much about the actual relevance of the content. ...

[Consistent Autoformalization for Constructing Mathematical Libraries 🔗](https://arxiv.org/abs/2410.04194)

Bridging the Gap Between English and Code – How AI is Learning to Write Formal Math Libraries

Mathematics is often called the universal language. However, there is a distinct difference between the mathematics humans write in textbooks—informal, intuitive, and filled with natural language—and the mathematics computers can verify. The latter requires formalization, a rigorous translation of mathematical concepts into code that a theorem prover (like Isabelle or Lean) can execute and check for logical validity. This translation process, known as autoformalization, is a massive bottleneck in the field. Building large libraries of formal mathematics usually requires a community of human experts working for years. But what if Large Language Models (LLMs) could do the heavy lifting? ...

[Consecutive Batch Model Editing with HooK Layers 🔗](https://arxiv.org/abs/2403.05330)

Hook, Line, and Sinker: Mastering Continuous LLM Updates with CoachHooK

Introduction Imagine you have a brilliant encyclopedia, but it was printed in 2021. It knows who the President of the United States is then, but it has no idea about current events, new scientific discoveries, or corrections to mistakes printed in its pages. This is the exact predicament we face with Large Language Models (LLMs). They are static snapshots of the internet at a specific point in time. When an LLM “hallucinates” or relies on outdated information, our primary instinct might be to retrain it. But retraining a massive model like GPT-J or Llama-2 is astronomically expensive in terms of time and computation. This has given rise to the field of Model Editing—surgical interventions to update specific facts in a model without breaking its general capabilities. ...

[Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game 🔗](https://aclanthology.org/2024.emnlp-main.1182.pdf)

Can AI Beat You at NYT Connections? A Deep Dive into Abstract Reasoning in LLMs

If you have a smartphone and a pulse, you are likely familiar with the morning ritual of millions: the New York Times Games app. While Wordle tests your vocabulary and Sudoku tests your logic, there is one game that consistently causes frustration, delight, and intense debate in group chats everywhere: Connections. The premise is deceptively simple. You are given 16 words. Your job is to sort them into four categories of four words each. But as any player knows, the game is a minefield of “red herrings,” obscure trivia, and lateral thinking puzzles that require you to look at a word not just for what it means, but for how it is spelled, what it sounds like, or what word might come after it in a phrase. ...

[Conditional and Modal Reasoning in Large Language Models 🔗](https://arxiv.org/abs/2401.17169)

If LLMs are Smart, Why Do They Fail at "Might" and "Must"? A Deep Dive into Conditional Reasoning

Introduction We often talk about Large Language Models (LLMs) as being “intelligent,” capable of passing the Bar exam, writing code, and summarizing history. But when we strip away the vast encyclopedic knowledge and look at the bare metal of reasoning, how smart are they really? Specifically, do they understand the fundamental logic that underpins human language? A recent paper titled “Conditional and Modal Reasoning in Large Language Models” by researchers from UC Berkeley, NYU, and MIT takes a magnifying glass to this exact question. Instead of testing models on math word problems or trivia, the researchers probed something more subtle but arguably more fundamental: the ability to reason about possibilities. ...

[Concept-skill Transferability-based Data Selection for Large Vision-Language Models 🔗](https://arxiv.org/abs/2406.10995)

COINCIDE: How Small Models Can Teach Large Models to Select Better Training Data

The development of Large Vision-Language Models (LVLMs) like LLaVA and GPT-4V has revolutionized how machines understand the world. These models are typically trained in two stages: first, massive pretraining on image-caption pairs, and second, Visual Instruction Tuning (VIT). The second stage is crucial—it teaches the model to actually follow user instructions, answer questions, and reason about visual content. However, we have hit a bottleneck. To make these models generalize well, we keep feeding them larger and larger datasets. Training on millions of instruction pairs is prohibitively expensive for most academic labs and smaller organizations. It raises a critical question: Do we really need all this data? ...

[Concept Space Alignment in Multilingual LLMs 🔗](https://arxiv.org/abs/2410.01079)

Do LLMs Think in a Universal Language? Decoding Concept Space Alignment

Do LLMs Think in a Universal Language? Decoding Concept Space Alignment When you ask a multilingual Large Language Model (LLM) like Llama-2 or BLOOMZ to translate a sentence from English to French, or to reason about a concept in Japanese, what is actually happening under the hood? Does the model maintain separate “brains” for each language, or has it developed a shared, universal “concept space” where the idea of a “dog” is stored in the same mathematical location, regardless of whether it is called “dog,” “chien,” or “inu”? ...