ACL 2025

[Enhancing Retrieval Systems with Inference-Time Logical Reasoning 🔗](https://arxiv.org/abs/2503.17860)

When Vector Search Fails: Teaching Retrieval Systems to Think Logically

When Vector Search Fails: Teaching Retrieval Systems to Think Logically If you have ever built a search engine or a RAG (Retrieval-Augmented Generation) pipeline, you are likely familiar with the magic of vector embeddings. You take a user’s query, squish it into a dense vector, and search for documents that are “close” to that vector in high-dimensional space. It is efficient, scalable, and generally works well for semantic similarity. But there is a catch. ...

[Enhancing NER by Harnessing Multiple Datasets with Conditional Variational Autoencoders 🔗](https://aclanthology.org/2025.acl-short.87.pdf)

Bridging the Gap—How CVAEs Help Train NER Models Across Conflicting Datasets

Introduction In the world of Natural Language Processing (NLP), data is fuel. For tasks like Named Entity Recognition (NER)—where the goal is to identify and classify terms like chemicals, diseases, or genes—performance is strictly tied to the quantity of high-quality, labeled training data. While Large Language Models (LLMs) have shown impressive zero-shot capabilities, full fine-tuning or supervised learning remains the gold standard for achieving top-tier accuracy in specialized domains like biomedicine. ...

[Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding 🔗](https://arxiv.org/abs/2502.13738)

Stop Ignoring the Prompt: Boosting In-Context Learning with Contrastive Decoding

Stop Ignoring the Prompt: Boosting In-Context Learning with Contrastive Decoding Large Language Models (LLMs) like GPT-4 and Llama-3 have revolutionized the way we approach Natural Language Processing (NLP). One of their most powerful features is In-Context Learning (ICL). Instead of fine-tuning a model for hours on a specific dataset, you simply provide a few examples (demonstrations) in the prompt, and the model figures out the pattern. It feels like magic. You give the model three examples of translating English to French, and it translates the fourth sentence perfectly. ...

[Efficient Knowledge Editing via Minimal Precomputation 🔗](https://arxiv.org/abs/2506.04226)

FastMEMIT: How to Edit LLMs in Minutes, Not Hours

Imagine you have just deployed a massive Large Language Model (LLM). It works beautifully, until a user asks, “Who is the Prime Minister of the UK?” and the model confidently names a politician who left office three years ago. Retraining the model is out of the question—it costs too much money and takes too much time. This is where Knowledge Editing comes in. Techniques like MEMIT (Mass-Editing Memory in a Transformer) allow us to surgically alter the weights of a model to update specific facts without ruining the model’s general capabilities. ...

[Dynamical Order Template Prediction for Generative Aspect-Based Sentiment Analysis 🔗](https://arxiv.org/abs/2406.11130)

Beyond Static Prompts—Making Sentiment Analysis Efficient with Dynamic Order Templates

Introduction Imagine you are building an AI to analyze customer reviews for a restaurant. You receive the feedback: “The steak was incredible, but the service was agonizingly slow.” If you use standard sentiment analysis, the model might just output “Mixed” or “Neutral.” That isn’t very helpful. You need to know specifically that the food was positive and the service was negative. This is the domain of Aspect-Based Sentiment Analysis (ABSA). ...

[Doc-React: Multi-page Heterogeneous Document Question-answering 🔗](https://aclanthology.org/2025.acl-short.6.pdf)

Beyond Simple RAG: How Doc-React Solves Complex Multimodal Question Answering

Introduction Imagine you are a financial analyst tasked with answering a specific question based on a 100-page annual report. The report isn’t just text; it is a chaotic mix of paragraphs, bar charts, scatter plots, and infographics spread across different pages. To answer the question, you can’t just find a single sentence. You might need to look at a chart on page 5 to identify a specific region, and then use that region to find a corresponding revenue figure in a table on page 12. ...

[Do Multimodal Large Language Models Truly See What We Point At? Investigating Indexical, Iconic, and Symbolic Gesture Comprehension 🔗](https://aclanthology.org/2025.acl-short.40.pdf)

The Pointing Problem: Why AI Struggles to Follow Your Finger

Introduction Imagine you are standing in a crowded museum. You point to a distant exhibit and say to your friend, “Look at that!” Your friend instantly turns their head, follows the line of your finger, identifies the specific object among dozens of others, and understands exactly what you mean. This interaction, which feels instantaneous and effortless to humans, is a masterpiece of multimodal processing. It involves integrating visual data, spatial reasoning, and language into a coherent understanding of the world. ...

[Diffusion Directed Acyclic Transformer for Non-Autoregressive Machine Translation 🔗](https://aclanthology.org/2025.acl-short.64.pdf)

Bridging Speed and Quality: How Diff-DAT Brings Diffusion to Non-Autoregressive Translation

Introduction: The Speed vs. Quality Dilemma In the world of Natural Language Processing (NLP), the Transformer architecture is king. Specifically, for tasks like machine translation, autoregressive (AR) Transformers have set the gold standard for quality. They generate translations one word at a time, using the previously generated words as context for the next. This sequential nature ensures high coherence but creates a significant bottleneck: latency. Generating a long sentence takes a long time because you cannot compute the 10th word until you have computed the 9th. ...

[Different Speech Translation Models Encode and Translate Speaker Gender Differently 🔗](https://arxiv.org/abs/2506.02172)

The Masculine Default: Why Modern Speech Translation Models Struggle with Gender

Imagine you are using a real-time translation app. You speak into the microphone: “I was born in London.” You are a woman. The app translates your sentence into French. In English, the sentence is neutral. But in French, grammar demands a choice. If the speaker is female, it should be “Je suis née à Londres.” If the speaker is male, it is “Je suis né à Londres.” How does the AI decide? In text-to-text translation, the system has no clue; it usually guesses (often defaulting to the masculine form). But in Speech Translation (ST), the model has access to your voice. Ideally, the AI should “hear” the acoustic features associated with your voice, encode that information, and use it to select the correct grammatical gender. ...

[Decoder-Only LLMs can be Masked Auto-Encoders 🔗](https://aclanthology.org/2025.acl-short.57.pdf)

Unifying Generation and Embeddings: How UniMAE Transforms Decoder-Only LLMs

In the rapidly evolving landscape of Natural Language Processing (NLP), we currently face a “two-model problem.” If you are building a Retrieval-Augmented Generation (RAG) system, you typically need two distinct architectures: a retriever (usually a bidirectional encoder like BERT) to handle embeddings and search, and a generator (a decoder-only LLM like GPT or Llama) to synthesize the answer. This structural separation creates inefficiencies. It doubles deployment costs and prevents knowledge sharing between tasks. What if a single Large Language Model (LLM) could handle both high-quality text generation and high-quality sentence representation simultaneously? ...

[Cross-Lingual Transfer of Cultural Knowledge: An Asymmetric Phenomenon 🔗](https://arxiv.org/abs/2506.01675)

The One-Way Street: Why Cultural Knowledge Doesn't Always Flow Freely in LLMs

Introduction Large Language Models (LLMs) have achieved remarkable proficiency in translating between languages. You can ask a model to translate a sentence from English to Tibetan, and it will often do a passable job. But language is more than just grammar and vocabulary; it is the vessel for culture. A critical question facing the AI research community is whether LLMs actually “understand” the culture associated with the languages they speak, or if they are merely mapping words. More specifically, how does cultural knowledge move between languages? If an LLM learns about a Korean festival while training on English text, does it automatically know about that festival when prompted in Korean? Conversely, does learning a low-resource language like Mongolian teach the model about Mongolian culture in English? ...

[Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning 🔗](https://arxiv.org/abs/2505.13628)

Seeing is Translating: How Images Can Teach AI to Speak Low-Resource Languages

Introduction: The Tower of Babel Problem in AI Imagine you are trying to learn a language that is completely foreign to you—perhaps Quechua or Swahili—and you have no dictionary. You do, however, have a photo album. You point to a picture of a dog, and a local speaker says “allqu” (in Quechua). You point to a picture of the sun, and they say “inti.” Eventually, without ever seeing a direct translation to English, you begin to understand the language through the shared reality of the visual world. ...

[Counterfactual-Consistency Prompting for Relative Temporal Understanding in Large Language Models 🔗](https://arxiv.org/abs/2502.11425)

Fixing the Timeline: How Counterfactuals Teach LLMs to Understand Time

Large Language Models (LLMs) like GPT-4 and Llama-3 are impressive polymaths. They can write poetry, debug code, and summarize history. But for all their sophistication, they often struggle with a concept that a primary school student grasps intuitively: Time. Specifically, LLMs struggle with relative temporal understanding. If you tell a model that “John finished dinner before he went for a walk,” and then ask, “Did John go for a walk after dinner?”, a human knows the answer is immediately “Yes.” However, LLMs frequently get confused by these logical entanglements. They suffer from temporal inconsistency—they might correctly answer one version of the question but contradict themselves when the question is phrased slightly differently. ...

[ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT 🔗](https://arxiv.org/abs/2506.04929)

Beyond Text: How Images and Metadata Are Revolutionizing E-Commerce Translation

Imagine walking into a store and seeing a label that just says “Pen.” If you are standing in the stationery aisle, you immediately know it’s a writing instrument. But if you are standing in the farming supplies section, that same word—“pen”—likely refers to an enclosure for animals. The word hasn’t changed, but the context has shifted the meaning entirely. This ambiguity is the arch-nemesis of Machine Translation (MT). For years, Neural Machine Translation (NMT) systems, like the ones powering Google Translate or DeepL, have translated sentences in isolation. They treat text as a vacuum, ignoring the visual or categorical world surrounding it. While this works well for generic documents, it often fails in the high-stakes, nuance-heavy world of e-commerce. ...

[Combining Domain and Alignment Vectors Provides Better Knowledge-Safety Trade-offs in LLMs 🔗](https://aclanthology.org/2025.acl-short.22.pdf)

MERGEALIGN: How to Build Safe Expert LLMs Without the Alignment Tax

Introduction In the rapidly evolving landscape of Large Language Models (LLMs), we are witnessing a shift from general-purpose chatbots to highly specialized “domain experts.” We now have models fine-tuned specifically for finance, medicine, coding, and law. These experts can pass board exams and analyze complex fiscal reports with accuracy that far surpasses a standard GPT-4 or Llama-3 model. However, specialization comes at a steep price. To create an expert, we typically take a base model and fine-tune it heavily on domain-specific data (like medical journals or case law). In doing so, we often break the model’s “safety alignment.” The resulting expert might be a genius at diagnosis, but it forgets the ethical guardrails that prevent it from generating harmful content, toxic responses, or dangerous advice. Conversely, if we try to train these experts to be safe, they often lose their edge—a phenomenon known as the alignment tax, where safety training degrades the model’s utility. ...

[CoRet: Improved Retriever for Code Editing 🔗](https://aclanthology.org/2025.acl-short.62.pdf)

How CoRet Revolutionizes Code Navigation for AI Agents

Imagine you are a software engineer joining a massive, legacy codebase for the first time. You are assigned a ticket: “Fix the bug where the user login times out on the dashboard.” Your first challenge isn’t fixing the code; it is finding where that code lives among thousands of files and tens of thousands of functions. This “needle in a haystack” problem is exactly what AI coding agents face today. While Large Language Models (LLMs) are excellent at writing code, they struggle significantly with retrieval—locating the specific files and functions that need editing based on a natural language request. ...

[ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events 🔗](https://arxiv.org/abs/2501.03040)

Does AI Know What Time It Is? Unpacking ChronoSense and Temporal Reasoning in LLMs

Introduction Imagine you are reading a history book. You read that the “Fourth Cholera Pandemic” lasted from 1863 to 1875, and “World War II” occurred between 1939 and 1945. If someone asked you, “Did the pandemic happen before the war?” the answer is immediate and obvious. You don’t need to perform complex calculus; you simply compare the timelines. This intuitive grasp of time—understanding that events have durations, that they can overlap, start together, or follow one another—is fundamental to human cognition. ...

[Can Uniform Meaning Representation Help GPT-4 Translate from Indigenous Languages? 🔗](https://arxiv.org/abs/2502.08900)

Bridging the Gap: Can Semantic Graphs Teach GPT-4 Indigenous Languages?

Introduction In the era of Large Language Models (LLMs), it is easy to assume that artificial intelligence has “solved” language. We can open ChatGPT, type a sentence in English, and instantly receive a fluent translation in French, Spanish, or Japanese. However, this apparent mastery masks a significant digital divide. While models like GPT-4 excel at high-resource languages—those with billions of words of text available on the internet—they frequently fail when tasked with low-resource languages, and particularly indigenous languages. ...

[Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs 🔗](https://arxiv.org/abs/2506.00304)

Reading Muscles with Language Models: How LLMs Are Decoding Silent Speech

Introduction: The Voice Within Imagine trying to speak, but no sound comes out. You form the words with your mouth, your tongue moves, your jaw articulates, but the vocal cords remain silent. For millions of people suffering from speech impairments—such as those who have undergone laryngectomies—this is a daily reality. Technology has long sought to bridge this gap through Silent Speech Interfaces (SSIs). One of the most promising technologies in this field is surface electromyography (sEMG). By placing electrodes on the skin around the face and neck, sensors can detect the electrical activity of the muscles used for speech. Theoretically, if a computer can read these electrical signals, it can translate them into text. ...

[Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure 🔗](https://arxiv.org/abs/2506.12278)

Can AI Hack Your Code? Introducing TestCase-Eval for LLM Test Generation

The rise of Large Language Models (LLMs) like GPT-4 and Qwen has revolutionized how we write code. We can now prompt a model to generate complex algorithms, solve competitive programming problems, and scaffold entire applications. But any experienced software engineer knows that writing code is only half the battle. The other half—often the harder half—is testing it. If an AI generates a sorting algorithm, how do we know it works for every edge case? Can the AI itself generate the test cases needed to verify that code? ...