ACL 2025

[Zero-Shot Text-to-Speech for Vietnamese 🔗](https://arxiv.org/abs/2506.01322)

PhoAudiobook: Bridging the Gap in Vietnamese Zero-Shot Text-to-Speech

Introduction In the rapidly evolving world of Generative AI, Text-to-Speech (TTS) has moved far beyond the robotic voices of the past. We have entered the era of Zero-Shot TTS. This technology allows a system to clone a speaker’s voice using only a few seconds of reference audio, without ever having been trained on that specific person’s voice before. While models like VALL-E and XTTS have revolutionized this space for English, low-resource languages often get left behind. ...

[WinSpot: GUI Grounding Benchmark with Multimodal Large Language Models 🔗](https://aclanthology.org/2025.acl-short.85.pdf)

Taming the Desktop: How WinSpot Brings AI Agents to Windows

Imagine a digital assistant that doesn’t just chat with you but actually uses your computer. You tell it, “Open the settings and change my default browser to Edge,” and it navigates the menus, finds the right buttons, and clicks them—just like a human would. This is the promise of Graphical User Interface (GUI) automation. While we have seen rapid progress in AI agents that can browse the web or navigate mobile apps, the desktop environment—specifically Windows—remains a massive, largely unconquered frontier. ...

[WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging 🔗](https://arxiv.org/abs/2502.18316)

Why "None of the Above" is the Ultimate Test for LLMs: Introducing WiCkeD

Introduction In the rapidly evolving world of Large Language Models (LLMs), we have hit a peculiar wall: the students are becoming smarter than the tests. Benchmarks that were once considered difficult—covering everything from high school chemistry to professional law exams—are now being “saturated.” Models are scoring so high that it is becoming increasingly difficult to distinguish a good model from a great one. When a benchmark gets saturated, researchers usually have two options. The first is to build a brand new, harder dataset from scratch. This is expensive, time-consuming, and requires expert human annotation. The second option is to take existing benchmarks and try to make them harder. Recent attempts have involved adding more “distractors” (wrong answers) to questions to lower the odds of guessing correctly. However, generating plausible distractors that don’t accidentally confuse the right answer is a massive challenge in itself. ...

[Using Subtext to Enhance Generative IDRR 🔗](https://aclanthology.org/2025.acl-short.35.pdf)

Reading Between the Lines: How Subtext Enhances Implicit Discourse Relation Recognition in LLMs

When we communicate, we rarely say exactly what we mean. We rely on the listener to fill in the gaps. If someone says, “The new rate is payable Feb. 15,” and follows it with, “A record date hasn’t been set,” a human immediately understands the connection. There is a conflict here: the payment date is set, but the necessary record date isn’t. We infer a relationship of Concession (e.g., “however”). ...

[Unique Hard Attention: A Tale of Two Sides 🔗](https://arxiv.org/abs/2503.14615)

Left vs. Right: How a Trivial Tiebreaking Choice Defines Transformer Expressivity

If you have been following the explosion of theoretical research into Transformers, you know that understanding what these models can actually compute is just as important as watching their loss curves go down. We often idealize the Transformer architecture to study it mathematically. One common simplification is Unique Hard Attention (UHA). In standard “soft” attention (like in GPT-4), the model attends to all previous tokens with varying weights. In UHA, the model attends to exactly one token—the one with the highest attention score. ...

[TREECUT: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation 🔗](https://arxiv.org/abs/2502.13442)

When LLMs Can't Say "I Don't Know": Inside the TREECUT Dataset

Introduction If you have used recent Large Language Models (LLMs) like GPT-4o or o3-mini, you know they have become incredibly proficient at mathematics. On standard benchmarks like GSM8K (grade school math) or the more advanced MATH dataset, these models often achieve near-human or even superhuman performance. They can solve complex equations, reason through multi-step word problems, and write out their “chain of thought” to justify the answer. But there is a catch. These models are eager to please. In fact, they are often too eager. ...

[Transferring Textual Preferences to Vision-Language Understanding through Model Merging 🔗](https://arxiv.org/abs/2502.13487)

Frankenstein's Judge: How Merging Models Creates Better Vision-Language Evaluators Without Training

In the rapidly evolving world of Artificial Intelligence, we have become accustomed to models that can write poetry, code, and even describe images with uncanny accuracy. Large Vision-Language Models (LVLMs), like GPT-4V or Llama-Vision, have revolutionized how machines perceive the world. However, there is a distinct gap between generating content and evaluating it. Creating a model that can generate a caption for an image is one thing; creating a model that can look at five different captions and robustly judge which one is the most helpful, accurate, and safe is entirely different. This is the domain of Reward Models (RMs), the silent engines behind the Reinforcement Learning from Human Feedback (RLHF) process that aligns AI with human intent. ...

[Towards LLM-powered Attentive Listener: A Pragmatic Approach through Quantity Self-Repair 🔗](https://aclanthology.org/2025.acl-short.1.pdf)

Fixing the Hollow Bot: Teaching LLMs to Listen Like Humans via Self-Repair

Introduction We have all been there. You are venting to a chatbot—perhaps testing its capabilities or just looking for a sounding board—and you say, “I’m really stressed about my workload.” The bot replies, “I am sorry to hear you are stressed about your workload. Stress can be difficult.” Technically, the sentence is correct. Grammatically, it is perfect. But emotionally? It feels hollow. It feels like a template. It lacks the subtle “attentiveness” of a human listener who knows exactly when to ask for more detail and when to just offer a simple, “Man, that sounds rough.” ...

[Towards Geo-Culturally Grounded LLM Generations 🔗](https://arxiv.org/abs/2502.13497)

Can RAG Teach LLMs Culture? The Battle Between Knowledge Bases and Google Search

Large Language Models (LLMs) are often celebrated as universal tools, capable of translating languages and answering questions about the world. However, anyone who has used these models extensively knows that “universal” often really means “Western.” When you ask an LLM to tell a story about a family dinner, the default setting usually mirrors North American or Western European norms. The food, the etiquette, and the social dynamics often fail to resonate with users from Ethiopia, Indonesia, or Mexico. This isn’t just a matter of flavor; it’s a matter of utility and representation. LLMs are prone to stereotyping, erasing cultural nuances, or simply hallucinating facts about non-Western cultures because their training data is heavily skewed toward English-speaking, Western internet content. ...

[TigerLLM - A Family of Bangla Large Language Models 🔗](https://arxiv.org/abs/2503.10995)

TigerLLM: How High-Quality Data Can Teach Small Models to Roar in Bangla

Introduction: The Linguistic Divide in AI The current landscape of Artificial Intelligence is experiencing a massive linguistic disparity. While Large Language Models (LLMs) like GPT-4 and Claude have revolutionized how we interact with technology, their prowess is heavily skewed toward high-resource languages—primarily English. For the 237 million native speakers of Bangla—the fifth most spoken language in the world—this gap is palpable. While proprietary giants like GPT-4 perform reasonably well, they are closed systems. Meanwhile, open-source attempts to build “Bangla LLMs” have largely struggled, often failing to outperform even the base models they were built upon. ...

[The Role of Abstract Representations and Observed Preferences in the Ordering of Binomials in Large Language Models 🔗](https://aclanthology.org/2025.acl-short.55.pdf)

Do LLMs Follow Rules or Just Statistics? Investigating Binomial Ordering

Do LLMs Follow Rules or Just Statistics? Investigating Binomial Ordering Have you ever stopped to wonder why you say “bread and butter” rather than “butter and bread”? Or why “ladies and gentlemen” sounds natural, while “gentlemen and ladies” feels slightly jarring? In linguistics, these pairings are called binomials. They consist of two nouns joined by a conjunction, usually “and.” While the meaning of “salt and pepper” is identical to “pepper and salt,” native English speakers have strong, often rigid preferences for the ordering of these words. ...

[That doesn't sound right: Evaluating speech transcription quality in field linguistics corpora 🔗](https://aclanthology.org/2025.acl-short.49.pdf)

Cleaning Up the Noise: How to Improve ASR for Endangered Languages with Automated Quality Control

Introduction Imagine trying to teach a computer to understand a language spoken by only a few hundred people. You don’t have millions of hours of perfectly transcribed YouTube videos or audiobooks. Instead, you have a hard drive full of field recordings collected by linguists over the last twenty years: interviews recorded in windy villages, story-telling sessions interrupted by roosters, and transcriptions that are often incomplete or mixed with research notes. ...

[SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement 🔗](https://arxiv.org/abs/2504.03561)

SynWorld: Teaching AI Agents to "Dream" Their Way to Mastery

Imagine you are trying to learn how to fly a plane. You could read the flight manual, memorize every switch and gauge, and hope for the best when you get in the cockpit. Or, you could spend hours in a flight simulator, facing storms, engine failures, and tricky landings before ever leaving the ground. For Large Language Models (LLMs) acting as autonomous agents, the “learning” process has historically looked a lot like the first option. Agents—AI systems designed to use tools, browse the web, and execute tasks—often rely on static text descriptions (documentation) to understand how to act. When they encounter a new environment or a complex tool they haven’t seen before, they struggle. The manual might be outdated, the task might require a sequence of steps not described in the text, or the agent simply might not “understand” the nuance of the tool until it tries it. ...

[Subword models struggle with word learning, but surprisal hides it 🔗](https://arxiv.org/abs/2502.12835)

Do LLMs Know What a Word Is? The Hidden Flaw in Subword Tokenization

When a child learns a language, they don’t start by speaking in complex, grammatically correct sentences. They start with words. A baby learns to recognize “doggie” or “ball” as distinct, meaningful units long before they understand how to use them in a sentence like “The doggie plays with the ball.” In developmental psychology, word learning precedes syntax. But do Large Language Models (LLMs) learn the same way? We often treat LLMs as proxies for understanding human language acquisition, yet a recent paper titled “Subword models struggle with word learning, but surprisal hides it” by Bastian Bunzeck and Sina Zarrieß suggests we might be making a mistake. The researchers found that the most common way LLMs process text—subword tokenization—fundamentally changes how and when they learn words, making their learning process drastically different from humans. ...

[State-offset Tuning: State-based Parameter-Efficient Fine-Tuning for State Space Models 🔗](https://arxiv.org/abs/2503.03499)

Why Prompts Fail on Mamba: Introducing State-offset Tuning

If you have been following the recent developments in sequence modeling, you have likely heard of Mamba and State Space Models (SSMs). These architectures have emerged as powerful alternatives to Transformers, promising to solve the dreaded quadratic computational cost that plagues standard Attention mechanisms. However, as we shift from Transformers to SSMs, we are discovering a friction point: our existing toolbox doesn’t always work. Specifically, the techniques we use to fine-tune Large Language Models (LLMs) efficiently—known as Parameter-Efficient Fine-Tuning (PEFT)—often fail when applied to Mamba. ...

[Spurious Correlations and Beyond: Understanding and Mitigating Shortcut Learning in SDOH Extraction with Large Language Models 🔗](https://arxiv.org/abs/2506.00134)

When AI Jumps to Conclusions: Shortcut Learning and Bias in Clinical Text Analysis

Large Language Models (LLMs) are rapidly entering the healthcare space. We use them to summarize patient visits, answer medical questions, and extract structured data from messy clinical notes. The promise is enormous: automated systems that can read through thousands of history files to identify patients at risk due to Social Determinants of Health (SDOH). However, a recent study reveals a critical flaw in how these models “reason.” It turns out that models like Llama and Qwen often don’t read clinical notes the way a human would. Instead, they rely on shortcut learning—superficial patterns that allow them to guess the answer without understanding the context. ...

[Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs 🔗](https://arxiv.org/abs/2505.19155)

Speeding Up Video-LLMs for Free: Understanding Sparse-to-Dense Decoding

The capabilities of Large Language Models (LLMs) have expanded dramatically in recent years, moving from text-only processing to multimodal understanding. Among these advancements, Video-LLMs stand out for their ability to watch, analyze, and answer questions about video content. However, this capability comes with a significant computational cost. Processing video is fundamentally different from processing static images. A single video can be decomposed into hundreds or thousands of frames, and when these frames are converted into tokens, the resulting sequence length can be massive—often exceeding 100,000 tokens for a short clip. ...

[Should I Believe in What Medical AI Says? A Chinese Benchmark for Medication Based on Knowledge and Reasoning 🔗](https://aclanthology.org/2025.acl-short.91.pdf)

Can AI Be Your Pharmacist? Deconstructing the ChiDrug Benchmark

Imagine you are feeling unwell. You have a headache, a slight fever, and a history of asthma. You open a chat window with a powerful AI assistant and ask, “What can I take for this?” The AI confidently recommends a specific combination of pills. Ideally, this interaction saves you a trip to the doctor. But realistically, we face a terrifying question: Is the AI hallucinating? Large Language Models (LLMs) have demonstrated incredible prowess in passing medical licensing exams and summarizing patient notes. However, medication management is a high-stakes game where “mostly correct” isn’t good enough. A hallucination here doesn’t just mean a weird sentence; it means recommending a dosage that is toxic or a drug interaction that could be fatal. ...

[Seeking Rational Demonstrations for Large Language Models: A Domain Generalization Approach to Unsupervised Cross-Domain Keyphrase Generation 🔗](https://aclanthology.org/2025.acl-short.31.pdf)

Bridging the Gap: How Domain Generalization Helps LLMs Master Keyphrases in New Fields

Bridging the Gap: How Domain Generalization Helps LLMs Master Keyphrases in New Fields In the vast ocean of digital information, Keyphrase Generation (KPG) acts as a critical lighthouse. It condenses lengthy documents into a few punchy, representative phrases that summarize the core content. This technology powers search engines, document clustering, and recommendation systems. Traditionally, training these models required massive datasets of documents paired with human-annotated keyphrases. This works perfectly fine for academic papers, where datasets like KP20k are abundant. But what happens when you need to generate keyphrases for a completely different domain—say, biomedical reports or news articles—where you have zero labeled data? ...

[ScanEZ: Integrating Cognitive Models with Self-Supervised Learning for Spatiotemporal Scanpath Prediction 🔗](https://aclanthology.org/2025.acl-short.89.pdf)

How AI Learns to Read Like Humans: Inside ScanEZ

Introduction Reading feels like a continuous, fluid process. Your eyes glide across this sentence, absorbing meaning instantly—or at least, that is how it feels. In reality, human reading is a jerky, erratic ballet. Your eyes make rapid movements called saccades, stopping briefly at specific points called fixations. You might skip a common word like “the,” dwell longer on a complex word like “spatiotemporal,” or even jump backward (regress) to re-read a confusing phrase. ...