ACL 2025

[Can Community Notes Replace Professional Fact-Checkers? 🔗](https://arxiv.org/abs/2502.14132)

The Hidden Backbone - Why Community Notes Need Professional Fact-Checkers

The Hidden Backbone: Why Community Notes Need Professional Fact-Checkers In the ever-evolving landscape of social media, the battle against misinformation has taken a fascinating turn. For years, platforms like Facebook and Twitter (now X) relied on partnerships with professional fact-checking organizations—groups like Snopes, PolitiFact, and Reuters—to flag false claims. However, recently, the tide has shifted toward “community moderation.” The logic seems democratic and scalable: instead of paying a small group of experts to check a massive volume of content, why not empower the users themselves to police the platform? This is the philosophy behind Community Notes on X (formerly Twitter). The idea is that the “wisdom of the crowd” can identify falsehoods faster and more effectively than a newsroom. ...

[Call for Rigor in Reporting Quality of Instruction Tuning Data 🔗](https://arxiv.org/abs/2503.04807)

The Hyperparameter Lottery: Why We Might Be misjudging LLM Data Quality

Introduction In the current landscape of Large Language Model (LLM) development, data is the new gold. But not just any data—we are specifically obsessed with Instruction Tuning (IT) data. This is the dataset that turns a raw, text-predicting base model into a helpful chatbot that can answer questions, summarize emails, and write code. A prevailing trend in recent research is “Less is More.” Studies like LIMA (Less Is More for Alignment) have suggested that you don’t need millions of instructions to train a great model; you might only need 1,000 highly curated, high-quality examples. This has triggered a gold rush to find the perfect “high-quality” dataset. Every week, a new paper claims that “Dataset A is better than Dataset B” or that a specific filtering method selects the best data. ...

[CHEER-Ekman: Fine-grained Embodied Emotion Classification 🔗](https://arxiv.org/abs/2506.01047)

Decoding the Body Language of Text: How LLMs Learned to Feel

Introduction When you read the phrase “Her heart was racing,” what do you understand? Depending on the context, she could be terrified of a spider, or she could be looking at the love of her life. This is the challenge of Embodied Emotion. Emotions aren’t just abstract concepts in our brains; they are physical experiences. We clench our fists in anger, our stomachs churn in disgust, and our eyes widen in surprise. In Natural Language Processing (NLP), detecting explicit emotions (e.g., “I am happy”) is a solved problem. However, detecting the subtle, physical manifestations of emotion—and correctly classifying them—remains a significant hurdle. ...

[BQA: Body Language Question Answering Dataset for Video Large Language Models 🔗](https://arxiv.org/abs/2410.13206)

Can AI Read the Room? Decoding Body Language with the New BQA Dataset

Introduction We’ve all been there: a friend says “I’m fine,” but their crossed arms, avoided eye contact, and stiff posture scream the exact opposite. As humans, a massive chunk of our communication relies on these nonverbal cues. We interpret intent, emotion, and social dynamics not just by listening to words, but by “reading the room.” For Artificial Intelligence, specifically Video Large Language Models (VideoLLMs), this is a frontier that remains largely unconquered. While models like GPT-4o or Gemini are getting remarkably good at describing what objects are in a video, understanding the emotional subtext of human movement is a different beast entirely. Human body language lacks formal rules; it is fluid, culturally dependent, and often unconscious. ...

[Automatic detection of dyslexia based on eye movements during reading in Russian 🔗](https://aclanthology.org/2025.acl-short.5.pdf)

Eyes on the Page—Using LSTMs to Detect Dyslexia Through Gaze Patterns

Dyslexia is one of the most common learning disabilities, affecting an estimated 9% to 12% of the population. It is not a visual problem, nor is it related to intelligence; rather, it is a difficulty with phonological decoding—mapping sounds to letters. While the condition is lifelong, early diagnosis is the single most critical factor in ensuring a child stays on track in the educational system. The problem, however, is logistics. Standard testing batteries for dyslexia are expensive, time-consuming, and require one-on-one administration by trained specialists who are not always available in schools. This creates a bottleneck where many children slip through the cracks. ...

[Are Optimal Algorithms Still Optimal? Rethinking Sorting in LLM-Based Pairwise Ranking with Batching and Caching 🔗](https://arxiv.org/abs/2505.24643)

Why Big O Notation Lies to You: Rethinking Sorting for LLM Re-Ranking

Introduction If you have ever taken a computer science algorithms course, you know the drill. When asked “What is the efficient sorting algorithm?”, the answer is almost reflexively “Merge Sort,” “Heap Sort,” or “Quick Sort.” Why? Because of Big O notation. We are taught that $O(n \log n)$ is the gold standard for comparison-based sorting, while algorithms like Bubble Sort ($O(n^2)$) are relegated to the “never use in production” bin. ...

[An Effective Curriculum Learning for Sequence Labeling Incorporating Heterogeneous Knowledge 🔗](https://arxiv.org/abs/2402.13534)

Learning Like a Human: Accelerating Sequence Labeling with Dual-Stage Curriculum Learning

Introduction Imagine teaching a child to read. You wouldn’t start by handing them a complex legal contract or a page from Shakespeare. Instead, you begin with simple sentences: “The cat sat on the mat.” Once they master the basics, you gradually introduce more complex grammar, vocabulary, and sentence structures. This intuitive progression—learning the easy stuff before the hard stuff—is the foundation of Curriculum Learning (CL) in artificial intelligence. In the field of Natural Language Processing (NLP), however, we often ignore this intuition. We tend to train models by feeding them data in random batches, mixing simple phrases with incredibly complex, ambiguous sentences. ...

[Acoustic Individual Identification of White-Faced Capuchin Monkeys Using Joint Multi-Species Embeddings 🔗](https://aclanthology.org/2025.acl-short.51.pdf)

Decoding the Jungle: How Bird and Human AI Models Team Up to Identify Monkeys

Imagine standing in the dense tropical forests of Costa Rica. The air is thick with humidity, and the soundscape is a chaotic symphony of insect buzzes, bird calls, wind rustling through leaves, and distant rumbles. In the middle of this acoustic storm (“the cocktail party problem”), a white-faced capuchin monkey calls out. For a human researcher, identifying which specific monkey just made that sound is an arduous task requiring years of training and intense focus. For a computer, it’s even harder. The lack of large, labeled datasets for wild animals has long been a bottleneck in bioacoustics. We have massive datasets for human speech and decent ones for bird calls, but for a specific species of Neotropical primate? The data is scarce. ...

[Accelerating Dense LLMs via L0-regularized Mixture-of-Experts 🔗](https://aclanthology.org/2025.acl-short.39.pdf)

How to Turn Heavy Dense LLMs into Fast Sparse Experts - A Deep Dive into L0-MoE

Introduction: The Efficiency Bottleneck We are currently living in the era of the “Scaling Law.” The logic that has driven AI progress for the last few years is simple: bigger models equal better performance. Whether it’s Llama-3, Qwen2, or Mistral, increasing the parameter count consistently unlocks new capabilities in reasoning, coding, and general knowledge. However, this intelligence comes with a steep price tag: Inference Latency. Running a massive 70B or even 8B parameter model is computationally expensive. Every time you ask a chatbot a question, the model has to utilize all of its active parameters to generate a response. This leads to slow generation speeds and high operational costs. ...

[A Variational Approach for Mitigating Entity Bias in Relation Extraction 🔗](https://arxiv.org/abs/2506.11381)

Breaking the Habit: How Variational Information Bottleneck Reduces Entity Bias in Relation Extraction

Introduction Imagine you are reading a financial news headline: “Microsoft invests $10 billion in…” Before you even finish the sentence, your brain probably fills in the blank with “OpenAI.” You didn’t need to read the rest of the text because you relied on your prior knowledge of the entities involved. While this heuristic is useful for humans, it is a significant problem for Artificial Intelligence. In the field of Natural Language Processing (NLP), this phenomenon is known as Entity Bias. Models like BERT or RoBERTa often memorize connections between specific entities (e.g., “Microsoft” and “invest”) rather than understanding the context of the sentence. If the sentence actually read “Microsoft sues OpenAI,” a biased model might still predict an “investment” relationship simply because it over-relies on the names. ...

[A Measure of the System Dependence of Automated Metrics 🔗](https://arxiv.org/abs/2412.03152)

Is Your AI Metric Fair? Why We Need to Measure System Dependence in Machine Translation

Imagine you are a carpenter building tables. You have a ruler to measure the length of your work. But this ruler has a strange property: when you measure a table made of oak, an inch is exactly 2.54 centimeters. But when you measure a table made of pine, the ruler magically shrinks, and an “inch” becomes only 2 centimeters. As a result, your pine tables receive inflated measurements, while your oak tables are penalized. ...

[A Little Human Data Goes A Long Way 🔗](https://arxiv.org/abs/2410.13098)

The 2.5% Rule: Why Synthetic Data Still Needs a Human Touch

Introduction In the current landscape of Artificial Intelligence, data is the new oil, but the wells are running dry. From the early days of BERT to the massive scale of GPT-4, the growth of Language Models (LMs) has been fueled by an exponential increase in training data. However, we are approaching a critical bottleneck: high-quality, human-annotated data is expensive, slow to produce, and difficult to scale for specialized tasks. Faced with this “data hunger,” researchers and engineers have turned to a promising alternative: Synthetic Data. If modern Large Language Models (LLMs) are so smart, why not ask them to generate the training data for the next generation of models? It sounds like the perfect perpetual motion machine—AI teaching AI, eliminating the need for costly human labor. ...