Papers

[Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias 🔗](https://arxiv.org/abs/2504.13677)

When the Teacher is Biased: How Spurious Correlations Break Uncertainty Evaluation in LLMs

Large Language Models (LMs) have a well-known tendency to “hallucinate”—producing fluent but factually incorrect information. To mitigate this, researchers rely on Uncertainty Quantification (UQ). The goal of UQ is simple: we want the model to tell us when it is unsure, so we can flag those responses for human review or discard them entirely. But how do we know if a UQ method is actually working? We have to test it. Typically, we generate an answer, ask the UQ method for a confidence score, and then check if the answer is actually correct. If the UQ method assigns low confidence to wrong answers and high confidence to right answers, it works. ...

[Revisiting LLMs as Zero-Shot Time-Series Forecaster: Small Noise Can Break Large Models 🔗](https://arxiv.org/abs/2506.00457)

When Small Noise Breaks Large Models: A Reality Check on LLM Forecasting

Introduction In the current era of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 and LLaMA have become the hammer for every nail. From writing code to analyzing legal documents, their generalization capabilities are nothing short of extraordinary. Recently, this excitement has spilled over into the domain of time-series forecasting—the art of predicting future numerical values based on past data. The premise is seductive: If an LLM can predict the next word in a sentence, can’t it predict the next number in a sequence? This has given rise to “Zero-Shot Forecasting,” where pre-trained LLMs are used to predict stock prices, weather, or energy consumption without any domain-specific training. ...

[Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty? 🔗](https://arxiv.org/abs/2505.24778)

Can We Trust LLMs When They Say 'I'm Fairly Certain'? A Deep Dive into Epistemic Markers

Can We Trust LLMs When They Say “I’m Fairly Certain”? A Deep Dive into Epistemic Markers As Large Language Models (LLMs) like GPT-4 and Claude integrate deeper into high-stakes fields—medicine, law, and financial analysis—the question of reliability becomes paramount. It is not enough for a model to give an answer; we need to know how confident it is in that answer. Traditionally, researchers have looked at numerical confidence scores (like log-probabilities or explicit percentage outputs). But let’s be honest: that is not how humans communicate. If you ask a doctor for a diagnosis, they rarely say, “I am 87.5% confident.” They use epistemic markers—phrases like “I am fairly certain,” “It is likely,” or “I’m not sure.” ...

[Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints 🔗](https://arxiv.org/abs/2409.14469)

The SENSE Method: unlocking LLM Potential with Semantic Hints

Introduction In the rapid evolution of Natural Language Processing (NLP), we often assume that “more data” and “more structure” always equal better performance. For years, the gold standard for improving language models involves explicitly teaching them the grammatical and logical structures of language—a process known as Semantic Parsing. By converting a messy natural language sentence into a structured, logical form (like a diagram of who did what to whom), we helped models like BERT achieve state-of-the-art results. But as we transitioned into the era of Large Language Models (LLMs) like GPT-4 and LLaMA, a strange paradox emerged: feeding these giant models explicit structural data often makes them worse. ...

[Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora 🔗](https://arxiv.org/abs/2409.09613)

How to Filter the Web: Using 'Bad' Models to Find Good Data

The race to build better Large Language Models (LLMs) is often viewed as a race for more data. This is driven by the “scaling laws,” which suggest that model performance correlates directly with the size of the training corpus and the model parameters. However, recent developments in the field have added a crucial nuance to this rule: it’s not just about the quantity of data; it is equally, if not more, about the quality. ...

[Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human? 🔗](https://arxiv.org/abs/2502.09416)

Bridging the Gap: Why Automatic Evaluation for Grammar Correction Needs a Human Touch

If you have ever tried to grade an essay, you know it is subjective. Now, imagine trying to teach a computer to grade thousands of corrections for grammatical errors instantly. This is the challenge of Grammatical Error Correction (GEC) evaluation. We rely on automatic metrics to tell us which AI models are best at fixing our grammar. We assume that if a metric gives Model A a higher score than Model B, then Model A is actually better. But often, when humans review the outputs, they disagree with the metric. Why does this happen? ...

[Quantifying Misattribution Unfairness in Authorship Attribution 🔗](https://arxiv.org/abs/2506.02321)

The Innocent Suspect: Why AI Authorship Detectors Unfairly Target 'Average' Writers

Imagine a forensic investigation where a single anonymous email is the key piece of evidence. Investigators have a pool of 100 potential suspects. They run the email through a state-of-the-art AI Authorship Attribution system. The system spits out a ranked list, and “Suspect B” is at the very top. Suspect B becomes the primary focus of the investigation. Their life is scrutinized, their reputation damaged. But here is the twist: Suspect B didn’t write the email. The AI made a mistake. ...

[ProgCo: Program Helps Self-Correction of Large Language Models 🔗](https://arxiv.org/abs/2501.01264)

Can LLMs Fix Their Own Mistakes? How Pseudo-Code Unlocks True Self-Correction

Introduction Imagine you are taking a difficult math exam. You solve a problem, but you aren’t sure if the answer is correct. What do you do? You might re-read the question, or perhaps you try to solve it again. But the most effective students often use a different strategy: they plug the answer back into the equation to see if it fits, or they devise a rigorous checklist to verify their steps. ...

[Multilingual Gloss-free Sign Language Translation: Towards Building a Sign Language Foundation Model 🔗](https://arxiv.org/abs/2505.24355)

Bridging the Gap: A New Foundation for Multilingual Gloss-free Sign Language Translation

Imagine trying to translate a video of someone speaking without having a transcript of what they are saying, and doing so across ten different languages simultaneously. Now, replace the speech with hand gestures, facial expressions, and body movements. This is the immense challenge of Multilingual Sign Language Translation (MLSLT). For years, assistive technology has struggled to bridge the communication gap between the Deaf/Hard-of-hearing communities and the hearing world. While we have seen rapid progress in text-to-text translation (like Google Translate), Sign Language Translation (SLT) lags behind. ...

[Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding 🔗](https://arxiv.org/abs/2506.08371)

Unlocking the Vault: How Positional Contrastive Decoding Fixes Long-Context Amnesia in LLMs

Unlocking the Vault: How Positional Contrastive Decoding Fixes Long-Context Amnesia in LLMs If you have played with the latest Large Language Models (LLMs) like Llama-3 or GPT-4, you have likely noticed the massive context windows they advertise—128k, 200k, or even a million tokens. Theoretically, you could paste an entire Harry Potter book into the prompt and ask a specific question about a minor character in Chapter 3. But in practice, reality often falls short of the marketing. As the text gets longer, the model’s ability to retrieve specific details degrades. It might hallucinate, give vague answers, or simply fixate on the most recent text it saw, ignoring the critical information buried in the middle of the document. ...

[MindRef: Mimicking Human Memory for Hierarchical Reference Retrieval with Fine-Grained Location Awareness 🔗](https://arxiv.org/abs/2402.17010)

MindRef: Teaching LLMs to Remember Like Humans Using Hierarchical Retrieval

MindRef: Teaching LLMs to Remember Like Humans Using Hierarchical Retrieval Imagine you are trying to remember a specific detail from a book you read years ago—perhaps the name of a minor character in Harry Potter. Your brain doesn’t scan every sentence of every book you’ve ever read in a linear fashion. Instead, you likely perform a hierarchical search: first, you recall the specific book (Harry Potter and the Goblet of Fire), and then you mentally zoom in on the relevant scene to retrieve the name. ...

[Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation 🔗](https://arxiv.org/abs/2502.01491)

When Students Learn Too Much - Memorization and Hallucination in NMT Knowledge Distillation

In the world of Neural Machine Translation (NMT), bigger is almost always better. Large models with billions of parameters consistently outperform their smaller counterparts in translating complex languages. However, in production environments—like the translation app on your phone—deploying these massive models is impractical due to latency and memory constraints. To solve this, the industry relies heavily on Sequence-Level Knowledge Distillation (SeqKD). This technique involves a large “Teacher” model teaching a smaller “Student” model how to translate. Ideally, the Student learns to generalize like the Teacher while remaining lightweight. ...

[Meaning Variation and Data Quality in the Corpus of Founding Era American English 🔗](https://aclanthology.org/2025.acl-short.66.pdf)

Decoding the Constitution with AI: A Deep Dive into Historical Meaning and Data Quality

The United States Constitution is one of the most scrutinized documents in history. For centuries, judges, lawyers, and historians have debated the precise meaning of its words. In recent decades, a legal theory known as originalism—the idea that the Constitution should be interpreted according to its original public meaning at the time of enactment—has gained significant traction in the U.S. Supreme Court. But how do we know exactly what a word meant in 1787? ...

[MUSTS: MUltilingual Semantic Textual Similarity Benchmark 🔗](https://aclanthology.org/2025.acl-short.27.pdf)

Lost in Translation: Why We Need the MUSTS Benchmark for Multilingual AI

Introduction: The Challenge of Meaning Imagine you are building a search engine or a chatbot. A user types in: “The bird is bathing in the sink.” Later, another user types: “Birdie is washing itself in the water basin.” To a human, these sentences are virtually identical in meaning. To a computer, they are distinct strings of characters. The ability of a machine to understand that these two sentences convey the same semantic information is called Semantic Textual Similarity (STS). ...

[MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments 🔗](https://arxiv.org/abs/2501.01652)

Digital Detectives: Can LLMs Solve Complex Murder Mysteries?

Introduction In the canon of classic literature, the murder mystery stands apart. From Agatha Christie’s Hercule Poirot to Arthur Conan Doyle’s Sherlock Holmes, solving a crime requires a unique blend of skills: gathering scattered information, navigating complex webs of deception, understanding human psychology, and making logical deductions under pressure. For Artificial Intelligence researchers, this presents a fascinating challenge. We know Large Language Models (LLMs) like GPT-4 are excellent at processing text and passing standardized tests. But can they inhabit a persona, lie to protect a secret, or uncover a killer in a room full of suspects? ...

[Literary Evidence Retrieval via Long-Context Language Models 🔗](https://arxiv.org/abs/2506.03090)

Can AI Read Between the Lines? Benchmarking Long-Context LLMs as Literary Critics

Introduction For years, the “holy grail” of natural language processing has been true reading comprehension. We have moved from simple keyword matching to semantic search, and now to Large Language Models (LLMs) that can process massive amounts of information. But there is a distinct difference between processing text and understanding literature. Consider the act of analyzing a novel like The Great Gatsby or Frankenstein. When a literary scholar makes a claim—for example, arguing that a character’s clothing symbolizes their moral decay—they must support that claim with specific textual evidence. They don’t just search for the word “coat”; they recall the narrative arc, understand the subtext, and pinpoint the exact moment where the description supports their theory. ...

[Limited-Resource Adapters Are Regularizers, Not Linguists 🔗](https://arxiv.org/abs/2505.24525)

Soup for the Soul? Why Random Noise Improves Translation for Creole Languages

In the world of Natural Language Processing (NLP), there is a persistent dream: a universal translator that works for everyone, regardless of where they are from or what language they speak. While we have made massive strides with giants like English, French, and Spanish, the “long tail” of the world’s languages—specifically low-resource languages—remains left behind. Among the most underserved are Creole languages. Born from the contact between European and African languages during the era of colonialism, Creoles like Haitian, Papiamento, and Sango are spoken by millions but are often treated as “dialects” or “broken” versions of their lexifiers (the languages providing the vocabulary). This couldn’t be further from the truth; they are fully developed languages with unique grammatical structures. However, for AI, they pose a massive problem: there simply isn’t enough parallel text data (e.g., sentences translated from English to Haitian) to train massive models effectively. ...

[LexKeyPlan: Planning with Keyphrases and Retrieval Augmentation for Legal Text Generation: A Case Study on European Court of Human Rights Cases 🔗](https://aclanthology.org/2025.acl-short.32.pdf)

Why Legal AI Needs a Plan: Introducing LexKeyPlan

Why Legal AI Needs a Plan: Introducing LexKeyPlan Artificial Intelligence is reshaping the legal landscape. From drafting contracts to summarizing case briefs, Large Language Models (LLMs) are passing bar exams and performing statutory reasoning at impressive levels. However, if you are a law student or a legal professional, you know that “impressive” isn’t good enough when the stakes are high. In law, precision is everything. The Achilles’ heel of modern LLMs is hallucination. A model might write a beautifully persuasive argument but cite a case that doesn’t exist or apply a legal doctrine that was overruled ten years ago. ...

[Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs 🔗](https://arxiv.org/abs/2506.05629)

ID-SPAM: Making Soft Prompts Smarter with Self-Attention

ID-SPAM: Making Soft Prompts Smarter with Self-Attention The rise of Large Language Models (LLMs) like GPT-4, Llama, and RoBERTa has created a massive “elephant in the server room.” These models are incredibly capable, but they are also incredibly heavy. When you want to adapt a model with billions of parameters to a specific task—like legal analysis or medical diagnosis—retraining the whole thing (fine-tuning) is often computationally impossible for most researchers and smaller organizations. ...

[Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility 🔗](https://arxiv.org/abs/2503.17579)

Do LLMs Think Like Us? Probing the Production-Interpretation Gap

There is a central question currently dominating the field of Natural Language Processing (NLP): Are Large Language Models (LLMs) simply “stochastic parrots” mimicking patterns, or do they possess cognitive mechanisms similar to humans? Much of the current evaluation of LLMs focuses on the end result. If a model answers a question correctly or writes a coherent story, we assume it “understands.” However, cognitive plausibility isn’t just about the output; it is about the process. To truly test if an LLM is cognitively plausible, we need to see if it makes the same distinct mental moves that humans do when processing language. ...