Papers

[Learning Sparsity for Effective and Efficient Music Performance Question Answering 🔗](https://arxiv.org/abs/2506.01319)

Less is More: How Sparsity Solves the Complexity of Music Audio-Visual QA

Introduction Imagine standing in the middle of a crowded jazz club. The drummer is keeping a complex beat, the bassist is walking through a progression, the pianist is improvising, and the crowd is murmuring. If someone asked you, “How many instruments are playing?” or “Is the saxophone playing right now?”, your brain wouldn’t process every single photon of light or every microsecond of sound pressure. Instead, you would filter out the noise. You would focus on key visual cues—the glint of the saxophone, the movement of the drummer’s sticks—and isolate specific audio frequencies. You intuitively discard the redundancy to answer the question. ...

[Learning Auxiliary Tasks Improves Reference-Free Hallucination Detection in Open-Domain Long-Form Generation 🔗](https://arxiv.org/abs/2505.12265)

Can LLMs Detect Their Own Lies? Introducing RATE-FT for Long-Form Hallucination Detection

Introduction Large Language Models (LLMs) have transformed the way we interact with information, writing everything from code to creative essays. However, they suffer from a persistent and dangerous flaw: hallucination. This occurs when a model generates content that sounds plausible and authoritative but conflicts with real-world facts. While detecting hallucinations in short answers (like “Capital of France?”) is relatively well-studied, the challenge grows exponentially with open-domain long-form generation. When an LLM writes a biography, a historical summary, or a complex explanation, it might weave three truths with one subtle lie. Detecting that single fabrication within a paragraph of accurate text is incredibly difficult. ...

[LLMs syntactically adapt their language use to their conversational partner 🔗](https://arxiv.org/abs/2503.07457)

Do AI Models Subconsciously Copy Your Grammar? Inside Syntactic Adaptation in LLMs

Have you ever noticed that after hanging out with a specific friend for a while, you start talking like them? You might pick up their slang, match their speaking speed, or even start structuring your sentences the way they do. In linguistics and psychology, this is known as alignment. It is a fundamental part of human communication—we subconsciously adapt our language to our conversational partners to build rapport and ensure we are understood. ...

[LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks 🔗](https://arxiv.org/abs/2406.18403)

Can We Trust AI to Grade AI? A Deep Dive into JUDGE-BENCH

Can We Trust AI to Grade AI? A Deep Dive into JUDGE-BENCH In the rapidly evolving world of Natural Language Processing (NLP), we are facing a bottleneck. We can generate text faster than ever before, but evaluating the quality of that text remains a slow, expensive, and difficult process. Traditionally, the gold standard for evaluation has been human judgment. If you want to know if a translation is good or if a chatbot is helpful, you ask a human. ...

[LLM as Entity Disambiguator for Biomedical Entity-Linking 🔗](https://aclanthology.org/2025.acl-short.25.pdf)

Can LLMs Fix Biomedical Entity Linking? A New State-of-the-Art Approach

The biomedical field is notoriously difficult when it comes to text processing. Consider the term “diabetes.” In a general conversation, we know what this means. But in a medical paper, does it refer to Diabetes Mellitus? Diabetes Insipidus? Nephrogenic Diabetes Insipidus? Or perhaps a specific experimental induction of the disease in a lab rat? This is the challenge of Biomedical Entity Linking (EL). It isn’t enough to just find the word (Named Entity Recognition); we have to map that word to a specific, unique identifier (CUI) in a massive knowledge base like UMLS or MeSH. ...

[LAMB: A Training-Free Method to Enhance the Long-Context Understanding of SSMs via Attention-Guided Token Filtering 🔗](https://aclanthology.org/2025.acl-short.96.pdf)

Curing the Amnesia of State Space Models: Inside the LAMB Architecture

The landscape of Large Language Models (LLMs) is currently dominated by Transformers. However, as anyone who has tried to feed a textbook into a standard chatbot knows, Transformers have a weakness: the “quadratic bottleneck.” As the length of the input text increases, the computational cost explodes. This has led to a surge of interest in State Space Models (SSMs), such as Mamba. SSMs promise a “sub-quadratic” alternative, theoretically allowing models to process massive sequences efficiently. ...

[KNOWSHIFTQA: How Robust are RAG Systems when Textbook Knowledge Shifts in K-12 Education? 🔗](https://arxiv.org/abs/2412.08985)

When Textbooks Change: Can AI Unlearn What It Knows?

Imagine a student asks an AI tutor, “Which country has the largest population in the world?” If the AI relies solely on its internal training data (likely cut off around 2022 or 2023 for many models), it might confidently answer, “China.” However, as of mid-2023, India surpassed China. If that student is studying from an up-to-date geography textbook that explicitly states “India is the most populous country,” the AI’s answer is now wrong within the context of the classroom. ...

[Internal and External Impacts of Natural Language Processing Papers 🔗](https://arxiv.org/abs/2505.16061)

Beyond the Ivory Tower: How NLP Research Impacts the Real World

Introduction “Is ACL an AI conference?” This question, recently posed by opinion leaders in the field, highlights an ongoing identity crisis within Natural Language Processing (NLP). As Large Language Models (LLMs) like GPT-4 and Claude dominate the headlines, the line between computational linguistics and general artificial intelligence has blurred. But there is a more pressing question than how researchers define themselves: How does the world define NLP? For students and aspiring researchers entering this field, it is easy to view academia as an “Ivory Tower”—a closed loop where researchers write papers only to be cited by other researchers. However, the reality is far more dynamic. NLP research bleeds into technology patents, influences government policy, and sparks debates in the media. ...

[Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar 🔗](https://arxiv.org/abs/2505.19599)

Lost in Translation? Why Better Tokenization is Key to Japanese Grammar in LLMs

Introduction We often assume that as Large Language Models (LLMs) grow larger and train on more multilingual data, their grasp of grammar naturally improves across all languages. We look at benchmarks like MMLU and see impressive reasoning scores, leading us to believe the basics are solved. But are they? It turns out that even state-of-the-art models like GPT-4o and Llama 3 struggle with specific, nuanced grammatical rules in languages other than English. This isn’t just about hallucinating facts; it is about failing to adhere to the fundamental structural rules of a language. ...

[Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics 🔗](https://arxiv.org/abs/2506.00637)

Reading the Room: How the Shape of Probability Distributions Reveals Model Confidence

Introduction We have all experienced the “hallucination” problem with Large Language Models (LLMs). You ask a model for a fact, and it confidently states something completely incorrect. This isn’t just annoying; in fields like medicine, law, or automated decision-making, it can be dangerous. To mitigate this, we rely on confidence scores. Ideally, when a model is right, it should report a high confidence score. When it is unsure or likely to be wrong, it should report a low score. This relationship is known as calibration. If a model says it is 90% confident, it should be correct 90% of the time. ...

[Improving Parallel Sentence Mining for Low-Resource and Endangered Languages 🔗](https://aclanthology.org/2025.acl-short.17.pdf)

Breaking the Data Bottleneck: How to Mine Parallel Sentences for Endangered Languages

Introduction Imagine trying to learn a language that has no dictionary, no textbook, and no Google Translate support. Now, imagine trying to teach a computer to translate that language. This is the reality for thousands of “low-resource” and endangered languages around the world. Modern Machine Translation (MT) systems, like the ones behind the translation tools we use daily, are data-hungry beasts. They learn by looking at millions of examples of translated sentences—known as parallel data. For example, to learn English-to-Spanish translation, the model analyzes huge datasets containing English sentences paired with their Spanish equivalents. ...

[Improving Fairness of Large Language Models in Multi-document Summarization 🔗](https://arxiv.org/abs/2506.07479)

FairPO: Teaching LLMs to Summarize Diverse Opinions Fairly

Imagine you are shopping online for a new laptop. You scroll down to the reviews to gauge public opinion. There are 50 reviews: 25 praise the battery life, and 25 complain about the screen resolution. You don’t have time to read them all, so you ask an AI assistant to summarize them. The AI returns a summary: “Users report that the screen resolution is disappointing and grainy.” Technically, the AI didn’t lie—people did say that. However, by omitting the 25 positive reviews about the battery, the summary is fundamentally unfair. It misrepresents the collective opinion of the document set. When Large Language Models (LLMs) perform Multi-Document Summarization (MDS), this type of bias can significantly impact decision-making in e-commerce, political analysis, and media monitoring. ...

[Human Alignment: How Much Do We Adapt to LLMs? 🔗](https://aclanthology.org/2025.acl-short.47.pdf)

Are We Thinking Like Machines? How Humans Subconsciously Adapt to AI

Introduction: The Mirror Effect In the rapidly evolving landscape of artificial intelligence, a massive amount of energy is spent on “alignment.” Researchers, ethicists, and engineers are constantly working to ensure that Large Language Models (LLMs) like GPT-4 align with human values, instructions, and safety guidelines. We want the AI to understand us, speak like us, and serve our needs. But there is a flip side to this coin that is rarely explored: Are we aligning with them? ...

[Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress 🔗](https://arxiv.org/abs/2506.19571)

Beating the Gold Standard: Have AI Metrics Surpassed Human Evaluators?

Introduction In the field of Artificial Intelligence, specifically Natural Language Processing (NLP), we often view human performance as the ultimate “ceiling” to reach. Whether it is playing Chess, Go, or translating text, “Human Parity” is the holy grail. Once an AI system performs as well as a human, we consider the problem largely solved. But there is a paradox emerging in the sub-field of Machine Translation (MT) Evaluation. We use automated metrics (algorithms that grade translations) to speed up research because human evaluation is slow and expensive. To know if these metrics work, we compare them against a “Gold Standard”—human judgment. ...

[Grounded, or a Good Guesser? A Per-Question Balanced Dataset to Separate Blind from Grounded Models for Embodied Question Answering 🔗](https://aclanthology.org/2025.acl-short.11.pdf)

When LLMs Cheat: Why We Need Per-Question Balancing in Embodied AI

Imagine you are designing a search-and-rescue robot. You send it into a collapsed building and ask, “Is there anyone behind that concrete slab?” If the robot answers “No” because it used its camera, scanned the area, and saw nothing, that is a success. But what if the robot answered “No” simply because its training data suggests that statistically, people are rarely found behind concrete slabs? The latter is a disaster waiting to happen. The robot isn’t looking; it’s guessing based on prior knowledge. ...

[GenKnowSub: Improving Modularity and Reusability of LLMs through General Knowledge Subtraction 🔗](https://arxiv.org/abs/2505.10939)

Less is More: How Subtracting General Knowledge Improves Modular LLMs

Introduction The current landscape of Artificial Intelligence is dominated by a “bigger is better” mentality. We train massive Large Language Models (LLMs) on trillions of tokens, hoping they learn everything from Python coding to French poetry. However, this monolithic approach has a downside: when we want the model to learn a new task, we often have to retrain or fine-tune the whole system—or at least a significant part of it. This is computationally expensive and rigid. ...

[FocalPO: Enhancing Preference Optimization by Focusing on Correct Preference Rankings 🔗](https://arxiv.org/abs/2501.06645)

Why Being Easy Pays Off: How FocalPO Improves LLM Alignment by Focusing on Correct Rankings

Reinforcement Learning from Human Feedback (RLHF) has established itself as the standard for aligning Large Language Models (LLMs) with human intent. While the traditional PPO-based pipeline was effective, it was also computationally expensive and unstable. The arrival of Direct Preference Optimization (DPO) changed the landscape by treating the language model itself as the reward model, streamlining the alignment process significantly. However, DPO is not without its flaws. The standard intuition in machine learning is to focus on “hard” examples—the cases where the model makes mistakes. DPO’s loss function inherently places high weight on preference pairs where the model incorrectly ranks the rejected response higher than the chosen one. ...

[Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering 🔗](https://arxiv.org/abs/2506.00806)

Thinking Fast and Slow in AI: How FOCUS Optimizes Visual Question Answering

Thinking Fast and Slow in AI: How FOCUS Optimizes Visual Question Answering Imagine you are looking at a picture of a clear blue sky. If I ask you, “What color is the sky?”, you answer instantly. You don’t need to squint, search, or think hard. It is intuitive. Now, imagine a picture of a crowded “Where’s Waldo?” scene. If I ask, " Is Waldo holding a cane?", your brain shifts gears. You stop scanning the whole image generally and start looking for specific features—red stripes, a hat, glasses. You deliberately ignore the distractions and focus on the target. ...

[FEAT: A Preference Feedback Dataset through a Cost-Effective Auto-Generation and Labeling Framework for English AI Tutoring 🔗](https://arxiv.org/abs/2506.19325)

Scaling AI Tutors: How FEAT Generates High-Quality Feedback without Breaking the Bank

Imagine a classroom where every student has a personal tutor—one that is infinitely patient, available 24/7, and knows exactly how to guide a student from a wrong answer to the right one without just giving it away. This has been the “North Star” of educational technology for decades. With the rise of Large Language Models (LLMs), this dream seems closer than ever. However, there is a catch. While LLMs are great at chatting, they aren’t naturally perfect teachers. To make them effective, they need to be trained on high-quality pedagogical data. specifically, they need to know what good feedback looks like compared to bad feedback. ...

[Enhancing Retrieval Systems with Inference-Time Logical Reasoning 🔗](https://arxiv.org/abs/2503.17860)

When Vector Search Fails: Teaching Retrieval Systems to Think Logically

When Vector Search Fails: Teaching Retrieval Systems to Think Logically If you have ever built a search engine or a RAG (Retrieval-Augmented Generation) pipeline, you are likely familiar with the magic of vector embeddings. You take a user’s query, squish it into a dense vector, and search for documents that are “close” to that vector in high-dimensional space. It is efficient, scalable, and generally works well for semantic similarity. But there is a catch. ...