Papers

[Unique Hard Attention: A Tale of Two Sides 🔗](https://arxiv.org/abs/2503.14615)

Left vs. Right: How a Trivial Tiebreaking Choice Defines Transformer Expressivity

If you have been following the explosion of theoretical research into Transformers, you know that understanding what these models can actually compute is just as important as watching their loss curves go down. We often idealize the Transformer architecture to study it mathematically. One common simplification is Unique Hard Attention (UHA). In standard “soft” attention (like in GPT-4), the model attends to all previous tokens with varying weights. In UHA, the model attends to exactly one token—the one with the highest attention score. ...

[TREECUT: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation 🔗](https://arxiv.org/abs/2502.13442)

When LLMs Can't Say "I Don't Know": Inside the TREECUT Dataset

Introduction If you have used recent Large Language Models (LLMs) like GPT-4o or o3-mini, you know they have become incredibly proficient at mathematics. On standard benchmarks like GSM8K (grade school math) or the more advanced MATH dataset, these models often achieve near-human or even superhuman performance. They can solve complex equations, reason through multi-step word problems, and write out their “chain of thought” to justify the answer. But there is a catch. These models are eager to please. In fact, they are often too eager. ...

[Transferring Textual Preferences to Vision-Language Understanding through Model Merging 🔗](https://arxiv.org/abs/2502.13487)

Frankenstein's Judge: How Merging Models Creates Better Vision-Language Evaluators Without Training

In the rapidly evolving world of Artificial Intelligence, we have become accustomed to models that can write poetry, code, and even describe images with uncanny accuracy. Large Vision-Language Models (LVLMs), like GPT-4V or Llama-Vision, have revolutionized how machines perceive the world. However, there is a distinct gap between generating content and evaluating it. Creating a model that can generate a caption for an image is one thing; creating a model that can look at five different captions and robustly judge which one is the most helpful, accurate, and safe is entirely different. This is the domain of Reward Models (RMs), the silent engines behind the Reinforcement Learning from Human Feedback (RLHF) process that aligns AI with human intent. ...

[Towards LLM-powered Attentive Listener: A Pragmatic Approach through Quantity Self-Repair 🔗](https://aclanthology.org/2025.acl-short.1.pdf)

Fixing the Hollow Bot: Teaching LLMs to Listen Like Humans via Self-Repair

Introduction We have all been there. You are venting to a chatbot—perhaps testing its capabilities or just looking for a sounding board—and you say, “I’m really stressed about my workload.” The bot replies, “I am sorry to hear you are stressed about your workload. Stress can be difficult.” Technically, the sentence is correct. Grammatically, it is perfect. But emotionally? It feels hollow. It feels like a template. It lacks the subtle “attentiveness” of a human listener who knows exactly when to ask for more detail and when to just offer a simple, “Man, that sounds rough.” ...

[Towards Geo-Culturally Grounded LLM Generations 🔗](https://arxiv.org/abs/2502.13497)

Can RAG Teach LLMs Culture? The Battle Between Knowledge Bases and Google Search

Large Language Models (LLMs) are often celebrated as universal tools, capable of translating languages and answering questions about the world. However, anyone who has used these models extensively knows that “universal” often really means “Western.” When you ask an LLM to tell a story about a family dinner, the default setting usually mirrors North American or Western European norms. The food, the etiquette, and the social dynamics often fail to resonate with users from Ethiopia, Indonesia, or Mexico. This isn’t just a matter of flavor; it’s a matter of utility and representation. LLMs are prone to stereotyping, erasing cultural nuances, or simply hallucinating facts about non-Western cultures because their training data is heavily skewed toward English-speaking, Western internet content. ...

[TigerLLM - A Family of Bangla Large Language Models 🔗](https://arxiv.org/abs/2503.10995)

TigerLLM: How High-Quality Data Can Teach Small Models to Roar in Bangla

Introduction: The Linguistic Divide in AI The current landscape of Artificial Intelligence is experiencing a massive linguistic disparity. While Large Language Models (LLMs) like GPT-4 and Claude have revolutionized how we interact with technology, their prowess is heavily skewed toward high-resource languages—primarily English. For the 237 million native speakers of Bangla—the fifth most spoken language in the world—this gap is palpable. While proprietary giants like GPT-4 perform reasonably well, they are closed systems. Meanwhile, open-source attempts to build “Bangla LLMs” have largely struggled, often failing to outperform even the base models they were built upon. ...

[The Role of Abstract Representations and Observed Preferences in the Ordering of Binomials in Large Language Models 🔗](https://aclanthology.org/2025.acl-short.55.pdf)

Do LLMs Follow Rules or Just Statistics? Investigating Binomial Ordering

Do LLMs Follow Rules or Just Statistics? Investigating Binomial Ordering Have you ever stopped to wonder why you say “bread and butter” rather than “butter and bread”? Or why “ladies and gentlemen” sounds natural, while “gentlemen and ladies” feels slightly jarring? In linguistics, these pairings are called binomials. They consist of two nouns joined by a conjunction, usually “and.” While the meaning of “salt and pepper” is identical to “pepper and salt,” native English speakers have strong, often rigid preferences for the ordering of these words. ...

[That doesn't sound right: Evaluating speech transcription quality in field linguistics corpora 🔗](https://aclanthology.org/2025.acl-short.49.pdf)

Cleaning Up the Noise: How to Improve ASR for Endangered Languages with Automated Quality Control

Introduction Imagine trying to teach a computer to understand a language spoken by only a few hundred people. You don’t have millions of hours of perfectly transcribed YouTube videos or audiobooks. Instead, you have a hard drive full of field recordings collected by linguists over the last twenty years: interviews recorded in windy villages, story-telling sessions interrupted by roosters, and transcriptions that are often incomplete or mixed with research notes. ...

[SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement 🔗](https://arxiv.org/abs/2504.03561)

SynWorld: Teaching AI Agents to "Dream" Their Way to Mastery

Imagine you are trying to learn how to fly a plane. You could read the flight manual, memorize every switch and gauge, and hope for the best when you get in the cockpit. Or, you could spend hours in a flight simulator, facing storms, engine failures, and tricky landings before ever leaving the ground. For Large Language Models (LLMs) acting as autonomous agents, the “learning” process has historically looked a lot like the first option. Agents—AI systems designed to use tools, browse the web, and execute tasks—often rely on static text descriptions (documentation) to understand how to act. When they encounter a new environment or a complex tool they haven’t seen before, they struggle. The manual might be outdated, the task might require a sequence of steps not described in the text, or the agent simply might not “understand” the nuance of the tool until it tries it. ...

[Subword models struggle with word learning, but surprisal hides it 🔗](https://arxiv.org/abs/2502.12835)

Do LLMs Know What a Word Is? The Hidden Flaw in Subword Tokenization

When a child learns a language, they don’t start by speaking in complex, grammatically correct sentences. They start with words. A baby learns to recognize “doggie” or “ball” as distinct, meaningful units long before they understand how to use them in a sentence like “The doggie plays with the ball.” In developmental psychology, word learning precedes syntax. But do Large Language Models (LLMs) learn the same way? We often treat LLMs as proxies for understanding human language acquisition, yet a recent paper titled “Subword models struggle with word learning, but surprisal hides it” by Bastian Bunzeck and Sina Zarrieß suggests we might be making a mistake. The researchers found that the most common way LLMs process text—subword tokenization—fundamentally changes how and when they learn words, making their learning process drastically different from humans. ...

[State-offset Tuning: State-based Parameter-Efficient Fine-Tuning for State Space Models 🔗](https://arxiv.org/abs/2503.03499)

Why Prompts Fail on Mamba: Introducing State-offset Tuning

If you have been following the recent developments in sequence modeling, you have likely heard of Mamba and State Space Models (SSMs). These architectures have emerged as powerful alternatives to Transformers, promising to solve the dreaded quadratic computational cost that plagues standard Attention mechanisms. However, as we shift from Transformers to SSMs, we are discovering a friction point: our existing toolbox doesn’t always work. Specifically, the techniques we use to fine-tune Large Language Models (LLMs) efficiently—known as Parameter-Efficient Fine-Tuning (PEFT)—often fail when applied to Mamba. ...

[Spurious Correlations and Beyond: Understanding and Mitigating Shortcut Learning in SDOH Extraction with Large Language Models 🔗](https://arxiv.org/abs/2506.00134)

When AI Jumps to Conclusions: Shortcut Learning and Bias in Clinical Text Analysis

Large Language Models (LLMs) are rapidly entering the healthcare space. We use them to summarize patient visits, answer medical questions, and extract structured data from messy clinical notes. The promise is enormous: automated systems that can read through thousands of history files to identify patients at risk due to Social Determinants of Health (SDOH). However, a recent study reveals a critical flaw in how these models “reason.” It turns out that models like Llama and Qwen often don’t read clinical notes the way a human would. Instead, they rely on shortcut learning—superficial patterns that allow them to guess the answer without understanding the context. ...

[Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs 🔗](https://arxiv.org/abs/2505.19155)

Speeding Up Video-LLMs for Free: Understanding Sparse-to-Dense Decoding

The capabilities of Large Language Models (LLMs) have expanded dramatically in recent years, moving from text-only processing to multimodal understanding. Among these advancements, Video-LLMs stand out for their ability to watch, analyze, and answer questions about video content. However, this capability comes with a significant computational cost. Processing video is fundamentally different from processing static images. A single video can be decomposed into hundreds or thousands of frames, and when these frames are converted into tokens, the resulting sequence length can be massive—often exceeding 100,000 tokens for a short clip. ...

[Should I Believe in What Medical AI Says? A Chinese Benchmark for Medication Based on Knowledge and Reasoning 🔗](https://aclanthology.org/2025.acl-short.91.pdf)

Can AI Be Your Pharmacist? Deconstructing the ChiDrug Benchmark

Imagine you are feeling unwell. You have a headache, a slight fever, and a history of asthma. You open a chat window with a powerful AI assistant and ask, “What can I take for this?” The AI confidently recommends a specific combination of pills. Ideally, this interaction saves you a trip to the doctor. But realistically, we face a terrifying question: Is the AI hallucinating? Large Language Models (LLMs) have demonstrated incredible prowess in passing medical licensing exams and summarizing patient notes. However, medication management is a high-stakes game where “mostly correct” isn’t good enough. A hallucination here doesn’t just mean a weird sentence; it means recommending a dosage that is toxic or a drug interaction that could be fatal. ...

[Seeking Rational Demonstrations for Large Language Models: A Domain Generalization Approach to Unsupervised Cross-Domain Keyphrase Generation 🔗](https://aclanthology.org/2025.acl-short.31.pdf)

Bridging the Gap: How Domain Generalization Helps LLMs Master Keyphrases in New Fields

Bridging the Gap: How Domain Generalization Helps LLMs Master Keyphrases in New Fields In the vast ocean of digital information, Keyphrase Generation (KPG) acts as a critical lighthouse. It condenses lengthy documents into a few punchy, representative phrases that summarize the core content. This technology powers search engines, document clustering, and recommendation systems. Traditionally, training these models required massive datasets of documents paired with human-annotated keyphrases. This works perfectly fine for academic papers, where datasets like KP20k are abundant. But what happens when you need to generate keyphrases for a completely different domain—say, biomedical reports or news articles—where you have zero labeled data? ...

[ScanEZ: Integrating Cognitive Models with Self-Supervised Learning for Spatiotemporal Scanpath Prediction 🔗](https://aclanthology.org/2025.acl-short.89.pdf)

How AI Learns to Read Like Humans: Inside ScanEZ

Introduction Reading feels like a continuous, fluid process. Your eyes glide across this sentence, absorbing meaning instantly—or at least, that is how it feels. In reality, human reading is a jerky, erratic ballet. Your eyes make rapid movements called saccades, stopping briefly at specific points called fixations. You might skip a common word like “the,” dwell longer on a complex word like “spatiotemporal,” or even jump backward (regress) to re-read a confusing phrase. ...

[SELF-PERCEPT: Introspection Improves Large Language Models' Detection of Multi-Person Mental Manipulation in Conversations 🔗](https://arxiv.org/abs/2505.20679)

Can AI Detect Gaslighting? Using Self-Perception Theory to Spot Manipulation in Group Chats

Human communication is a labyrinth of subtext. While we often say exactly what we mean, there are dark corners of interaction where words are used as weapons—not through overt insults, but through subtle psychological maneuvering. This is the realm of mental manipulation: gaslighting, guilt-tripping, feigning innocence, and strategic shaming. For years, Natural Language Processing (NLP) has become adept at spotting explicit toxicity, like hate speech or profanity. However, detecting manipulation is infinitely harder. It relies on context, relationship dynamics, and intent. It’s not just about what is said, but why it is said and how it influences others. ...

[Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias 🔗](https://arxiv.org/abs/2504.13677)

When the Teacher is Biased: How Spurious Correlations Break Uncertainty Evaluation in LLMs

Large Language Models (LMs) have a well-known tendency to “hallucinate”—producing fluent but factually incorrect information. To mitigate this, researchers rely on Uncertainty Quantification (UQ). The goal of UQ is simple: we want the model to tell us when it is unsure, so we can flag those responses for human review or discard them entirely. But how do we know if a UQ method is actually working? We have to test it. Typically, we generate an answer, ask the UQ method for a confidence score, and then check if the answer is actually correct. If the UQ method assigns low confidence to wrong answers and high confidence to right answers, it works. ...

[Revisiting LLMs as Zero-Shot Time-Series Forecaster: Small Noise Can Break Large Models 🔗](https://arxiv.org/abs/2506.00457)

When Small Noise Breaks Large Models: A Reality Check on LLM Forecasting

Introduction In the current era of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 and LLaMA have become the hammer for every nail. From writing code to analyzing legal documents, their generalization capabilities are nothing short of extraordinary. Recently, this excitement has spilled over into the domain of time-series forecasting—the art of predicting future numerical values based on past data. The premise is seductive: If an LLM can predict the next word in a sentence, can’t it predict the next number in a sequence? This has given rise to “Zero-Shot Forecasting,” where pre-trained LLMs are used to predict stock prices, weather, or energy consumption without any domain-specific training. ...

[Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty? 🔗](https://arxiv.org/abs/2505.24778)

Can We Trust LLMs When They Say 'I'm Fairly Certain'? A Deep Dive into Epistemic Markers

Can We Trust LLMs When They Say “I’m Fairly Certain”? A Deep Dive into Epistemic Markers As Large Language Models (LLMs) like GPT-4 and Claude integrate deeper into high-stakes fields—medicine, law, and financial analysis—the question of reliability becomes paramount. It is not enough for a model to give an answer; we need to know how confident it is in that answer. Traditionally, researchers have looked at numerical confidence scores (like log-probabilities or explicit percentage outputs). But let’s be honest: that is not how humans communicate. If you ask a doctor for a diagnosis, they rarely say, “I am 87.5% confident.” They use epistemic markers—phrases like “I am fairly certain,” “It is likely,” or “I’m not sure.” ...