EMNLP 2024

[CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures 🔗](https://arxiv.org/abs/2410.05235)

Beyond the Diagnosis: Teaching AI to Argue Like a Doctor with CasiMedicos-Arg

Imagine you are a resident doctor in a busy emergency room. You examine a patient, review their vitals, and turn to your attending physician with a diagnosis. “It’s pneumonia,” you say. The attending looks at you and asks the most terrifying question in medical education: “Why?” It is not enough to get the answer right. In medicine, the reasoning process—the chain of evidence connecting symptoms to a diagnosis—is just as critical as the conclusion itself. ...

[Casablanca: Data and Models for Multidialectal Arabic Speech Recognition 🔗](https://arxiv.org/abs/2410.04527)

Beyond Modern Standard: How 'Casablanca' is Revolutionizing Arabic Speech Recognition

Introduction: The “Speech Divide” If you are reading this, chances are you have used a voice assistant like Siri, Alexa, or Google Assistant. You might have even marveled at how accurate automated subtitles on YouTube have become. For speakers of English, French, or Spanish, we are living in the golden age of Automatic Speech Recognition (ASR). Large language models and self-supervised learning (SSL) have solved the majority of transcription problems for these “resource-rich” languages. ...

[CareCorpus+: Expanding and Augmenting Caregiver Strategy Data to Support Pediatric Rehabilitation 🔗](https://aclanthology.org/2024.emnlp-main.392.pdf)

Revolutionizing Pediatric Care: How Synthetic Data and LLMs Are Unlocking Caregiver Strategies

Introduction Globally, over 50 million children aged 0–5 years experience some form of disability. For these children and their families, pediatric rehabilitation is not just about clinical visits; it is about the daily grind of navigating life. It involves finding ways to participate in family dinners, play at the park, or manage school routines. In this context, caregivers—parents and guardians—are the unsung experts. They develop unique, personalized “strategies” to help their children succeed. ...

[Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! 🔗](https://arxiv.org/abs/2410.01023)

If AI Can Explain the Joke, Does It Understand? Testing Multimodal Literacy with Visual Puns

When a friend winks at you while saying, “I’m definitely going to stick to my diet today,” you immediately understand that they likely mean the opposite. You didn’t just process the text (the sentence); you integrated the visual cue (the wink) to resolve the ambiguity of their statement. This ability is known as multimodal literacy. It is the human capacity to actively combine information from different sources—text, images, gestures—to form a complete reasoning process. We do this intuitively when we look at a textbook illustration to understand a complex paragraph or when we read a caption to make sense of an abstract photo. ...

[Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization? 🔗](https://arxiv.org/abs/2406.17274)

The Shaky Foundation of Trust: Why Evaluating Uncertainty in Text Summarization is Harder Than We Thought

In the rapidly evolving world of Natural Language Generation (NLG), we have witnessed Large Language Models (LLMs) perform feats that were considered science fiction only a decade ago. From summarizing complex financial reports to condensing medical records, abstractive text summarization is reshaping industries. However, there is a catch. LLMs hallucinate. They can generate summaries that sound fluent and confident but are factually incorrect. In high-stakes domains—like healthcare or finance—relying on a false summary can have catastrophic consequences. ...

[Can Transformers Learn n-gram Language Models? 🔗](https://arxiv.org/abs/2410.03001)

Beyond the Hype—Are Transformers Actually Good at Learning Basic n-grams?

If you have been following the explosion of Natural Language Processing (NLP) in recent years, you know that the Transformer architecture is the engine behind the revolution. From GPT-4 to Claude, Transformers seem capable of mastering complex reasoning, coding, and creative writing. But in the research world, a fundamental question remains: Do we actually understand how they learn? There is a significant body of theoretical work exploring what Transformers can represent. For example, we know mathematically that a Transformer is capable of mimicking an n-gram language model (a simple model that predicts the next word based on the previous \(n-1\) words). But just because a neural network can represent a function doesn’t mean it will actually learn that function from data using gradient descent. ...

[Can Large Language Models Learn Independent Causal Mechanisms? 🔗](https://arxiv.org/abs/2402.02636)

Beyond Stochastic Parrots—Teaching LLMs to Think with Independent Causal Mechanisms

Introduction We are living in the golden age of Large Language Models (LLMs). Systems like GPT-4 and LLaMA have revolutionized how we interact with technology, demonstrating linguistic prowess that often feels like genuine intelligence. However, there is a “ghost in the machine.” Despite their fluency, these models often fail spectacularly when faced with tasks that require rigorous logical consistency or when the data distribution shifts slightly from what they saw during training. ...

[Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? 🔗](https://arxiv.org/abs/2405.16908)

Why Your LLM Sounds So Confident Even When It's Wrong: The Challenge of Faithful Uncertainty

Introduction We have all experienced it: you ask a Large Language Model (LLM) a specific factual question—perhaps about an obscure historical figure or a specific coding error—and it responds with absolute conviction. The grammar is perfect, the tone is authoritative, and the delivery is decisive. There is just one problem: the answer is completely wrong. This phenomenon highlights a critical gap in modern Artificial Intelligence. LLMs are trained to generate fluent, persuasive text, often at the expense of accuracy. While we call these “hallucinations,” the danger isn’t just that the model is wrong; it is that the model is persuasively wrong. It mimics the cadence of an expert even when it is essentially guessing. ...

[Can Large Language Models Enhance Predictions of Disease Progression? Investigating Through Disease Network Link Prediction 🔗](https://aclanthology.org/2024.emnlp-main.980.pdf)

ComLLM: How Large Language Models and Graphs Are Revolutionizing Disease Prediction

Introduction The digital transformation of healthcare has provided us with a staggering amount of data. Electronic health records (EHRs) track everything from routine checkups to critical diagnoses, creating a rich history of patient health. Yet, having data and effectively using it to predict the future are two very different things. One of the most critical challenges in modern medical AI is predicting disease progression and comorbidity—the likelihood that a patient with one condition (like diabetes) will develop another (like heart disease). ...

[Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? 🔗](https://arxiv.org/abs/2406.12809)

The Paradox of Intelligence—Why LLMs Fail at Easy Tasks While Acing Hard Ones

Introduction Imagine you are tutoring a student in calculus. They effortlessly solve a complex Gaussian integral, showing a deep understanding of advanced mathematical concepts. Impressed, you ask them a follow-up question: “What is 17 times 8?” The student stares blankly and answers, “106.” You would be baffled. In human cognition, capabilities are generally hierarchical; if you have mastered advanced calculus, it is taken for granted that you have mastered basic arithmetic. This is the essence of consistency. ...

[Can Language Models Induce Grammatical Knowledge from Indirect Evidence? 🔗](https://arxiv.org/abs/2410.06022)

The "Wug" Test for AI: Do LLMs Learn Like Humans?

The “Wug” Test for AI: Do LLMs Learn Like Humans? If you have ever taken an introductory linguistics class, you are likely familiar with the “Wug Test.” In 1958, Jean Berko Gleason showed children a picture of a bird-like creature and said, “This is a wug.” She then showed two of them and said, “Now there are another one. There are two of them. There are two…?” The children correctly answered “wugs.” ...

[Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators 🔗](https://arxiv.org/abs/2409.14037)

The Hallucinating Professor: Why LLMs Might Not Be Ready to Teach Science

Introduction Imagine a world where every student, regardless of their location or resources, has access to a personal tutor with the knowledge of Neil deGrasse Tyson, the mathematical intuition of Terence Tao, and the chemical expertise of Marie Curie. This is the promise of Large Language Models (LLMs) like GPT-4 and Llama-3. We have rapidly transitioned from using chatbots for writing emails to relying on them for summarizing complex research papers and explaining scientific concepts. ...

[Can LLMs Learn Uncertainty on Their Own? Expressing Uncertainty Effectively in a Self-Training Manner 🔗](https://aclanthology.org/2024.emnlp-main.1205.pdf)

Teaching LLMs to Doubt: How Self-Training Can Fix AI Overconfidence

Introduction: The Confidently Wrong Machine Imagine asking an AI assistant for medical advice or a legal precedent. It gives you an answer immediately, with perfect grammar and an authoritative tone. But there is a problem: the answer is completely made up. This phenomenon, often called “hallucination,” is one of the biggest hurdles in deploying Large Language Models (LLMs) like LLaMA or GPT-4 in critical real-world applications. The core issue isn’t just that models make mistakes—humans do that too. The issue is that LLMs often lack the ability to know when they are making a mistake. They are frequently “confidently wrong.” ...

[Can LLM Generate Culturally Relevant Commonsense QA Data? Case Study in Indonesian and Sundanese 🔗](https://arxiv.org/abs/2402.17302)

Can AI Understand Culture? A Deep Dive into Synthetic Data for Low-Resource Languages

Introduction: The “Snow” Problem in AI Imagine you are training an Artificial Intelligence to understand “commonsense.” You feed it thousands of questions to test its reasoning capabilities. One question asks: “The man needed to shovel his driveway. What season is it?” The answer, obviously, is winter. Now, imagine asking that same question to a student in Jakarta, Indonesia. They might look at you with confusion. Indonesia is a tropical country; people don’t shovel driveways, and it certainly doesn’t snow. The concept isn’t “commonsense”—it’s culturally irrelevant. ...

[Can Automatic Metrics Assess High-Quality Translations? 🔗](https://arxiv.org/abs/2405.18348)

The "Good Enough" Trap—Why AI Metrics Fail at Evaluating High-Quality Translations

In the rapidly evolving world of Machine Translation (MT), we have reached a pivotal moment. A few years ago, the goal of translation systems was simply to produce understandable text. Today, systems like Google Translate, DeepL, and GPT-4 produce translations that are often indistinguishable from human output. We are no longer dealing with “word salad”; we are dealing with nuance, style, and high-fidelity accuracy. But this success brings a new, insidious problem. The tools we use to grade these systems—automatic metrics like BLEU, COMET, and BLEURT—were designed and validated in an era where the difference between a “good” and a “bad” translation was obvious. ...

[Can Active Label Correction Improve LLM-based Modular AI Systems? 🔗](https://arxiv.org/abs/2401.05467)

Taming the Noise: How to Upgrade LLM Agents into Efficient, Fine-Tuned Systems

Taming the Noise: How to Upgrade LLM Agents into Efficient, Fine-Tuned Systems The rapid rise of Large Language Models (LLMs) like GPT-4 and LLaMA has popularized “Modular AI Systems.” Think of frameworks like LangChain, AutoGPT, or HuggingGPT. These systems chain together multiple LLM calls to perform complex tasks—planning a trip, writing code, or analyzing financial documents. They are incredibly powerful because they require zero training; you just write a prompt, and the system works. ...

[Calibrating the Confidence of Large Language Models by Eliciting Fidelity 🔗](https://arxiv.org/abs/2404.02655)

Why LLMs Lie About Being Sure: Introducing UF Calibration

Large Language Models (LLMs) like GPT-4 and LLaMA-2 have revolutionized how we interact with information. They are incredibly helpful, harmless, and creative. However, they have a notorious flaw: they don’t know when to shut up. Or, more accurately, they don’t know how to admit they are unsure. You have likely experienced an LLM hallucinating a fact with the same supreme confidence it uses to state that \(2+2=4\). This is a failure of calibration. ...

[Calibrating Language Models with Adaptive Temperature Scaling 🔗](https://arxiv.org/abs/2409.19817)

Fixing Overconfidence in LLMs with Adaptive Temperature Scaling

Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable fluency and reasoning capabilities. However, a persistent issue plagues even the most advanced models: calibration. Ideally, when an LLM says it is 80% confident in an answer, it should be correct 80% of the time. Unfortunately, this is rarely the case. Modern LLMs, particularly those fine-tuned with Reinforcement Learning from Human Feedback (RLHF), tend to be notoriously overconfident. They might hallucinate a completely incorrect fact but assign it a 99% probability score. In high-stakes fields like medicine, law, or autonomous coding, this disconnect between confidence and accuracy is dangerous. ...

[CAT-BENCH: Benchmarking Language Model Understanding of Causal and Temporal Dependencies in Plans 🔗](https://arxiv.org/abs/2406.15823)

Do LLMs Actually Understand Plans? Benchmarking Causal Reasoning with CAT-BENCH

Large Language Models (LLMs) have become exceptionally good at generating procedural text. If you ask a state-of-the-art model for a recipe to bake a cake, it will likely produce a perfectly coherent list of steps: mix the dry ingredients, beat the eggs, combine them, and bake at a specific temperature. On the surface, it looks like the model understands the process. But there is a significant difference between memorizing a sequence of words and understanding the causal logic that binds those steps together. Does the model know why you must mix the flour before baking? Does it understand that you can chop the nuts while the oven preheats, but you can’t frost the cake before it cools? ...

[CUTE: Measuring LLMs' Understanding of Their Tokens 🔗](https://arxiv.org/abs/2409.15452)

Do LLMs Actually Know How to Spell? Inside the CUTE Benchmark

Do LLMs Actually Know How to Spell? Inside the CUTE Benchmark When we interact with Large Language Models (LLMs) like GPT-4 or Llama 3, we often attribute human-like literacy to them. We assume that because a model can write a sonnet or debug Python code, it understands text the same way we do: letter by letter, word by word. However, this is a fundamental misconception. LLMs do not “read” text character by character. Instead, they process text through a tokenizer, which chunks characters into tokens. A common word like “the” is processed as a single atomic unit, not as the sequence t-h-e. To the model, the token the is just an integer ID in a massive list, distinct from The, theoretical, or lathe. ...