EMNLP 2024

[Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation 🔗](https://arxiv.org/abs/2402.18191)

Less Data, Better Models: How 'Clustering and Ranking' Revolutionizes Instruction Tuning

Introduction: The Quality vs. Quantity Dilemma In the current landscape of Large Language Model (LLM) development, there is a prevailing assumption that “more is better.” We often assume that to make a model smarter, we must feed it more tokens, more documents, and more instructions. This is generally true for the pre-training phase, where models learn the statistical structure of language. However, the rules change significantly during the Instruction Tuning (IT) phase—the final polish that teaches a model to act as a helpful assistant. ...

[Cluster-Norm for Unsupervised Probing of Knowledge 🔗](https://arxiv.org/abs/2407.18712)

Cleaning Up the Signal: How Cluster-Norm Improves Unsupervised Knowledge Discovery in LLMs

Large Language Models (LLMs) are impressive, but they are also black boxes. When an LLM outputs a statement, does it “believe” that statement is true, or is it merely simulating a persona that would say that statement? As we fine-tune models with human preferences, we risk teaching them to be sycophants—telling us what we want to hear rather than what is true. To build safer and more reliable AI, we need to look inside the black box. We need to extract the model’s internal “knowledge” directly from its activations, bypassing its text output. This field is known as knowledge elicitation. ...

[Closing the Loop: Learning to Generate Writing Feedback via Language Model Simulated Student Revisions 🔗](https://aclanthology.org/2024.emnlp-main.928.pdf)

Can AI Learn to Teach? Training Feedback Generators with Simulated Students

Introduction: The Teacher’s Dilemma If you have ever taught a class or mentored a junior colleague, you know the struggle: giving good feedback is hard. Giving good feedback at scale is nearly impossible. In the world of education, feedback is the engine of improvement. A student writes an essay, receives comments, and (hopefully) revises the work to make it better. This cycle helps students develop critical thinking, self-assessment skills, and mastery of the subject. However, for educators, providing detailed, actionable, and personalized feedback for dozens or hundreds of students is an overwhelming time sink. ...

[ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures 🔗](https://arxiv.org/abs/2406.09818)

Why RAG Struggles with Climate Reports: Introducing ClimRetrieve

Introduction Climate change is arguably the most pressing challenge of our time. To understand how the corporate world is adapting, stakeholders—from investors to regulators—rely heavily on corporate sustainability reports. These documents are massive, qualitative, and complex, often hiding crucial data regarding climate risks and strategies within dense textual narratives. To process this flood of information, the tech world has turned to Retrieval Augmented Generation (RAG). RAG systems use Artificial Intelligence to search through documents, find relevant paragraphs, and use them to generate answers to specific questions. It sounds like the perfect solution. However, there is a catch: we don’t actually know how good these systems are at the specific task of retrieval in the climate domain. ...

[CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios 🔗](https://arxiv.org/abs/2410.03502)

Beyond the Board Exam: Why Chinese Medical AI Needs Real-World Clinical Testing

Beyond the Board Exam: Why Chinese Medical AI Needs Real-World Clinical Testing We are living in an era where Artificial Intelligence is passing medical licensing exams with flying colors. Headlines frequently tout Large Language Models (LLMs) that can score passing grades on the USMLE or its Chinese equivalents. This has led to a surge of excitement—and hype—about the imminent arrival of “AI Doctors.” However, anyone who has been to medical school (or treated a patient) knows a fundamental truth: Passing an exam is not the same as practicing medicine. ...

[CLEANGEN: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models 🔗](https://arxiv.org/abs/2406.12257)

Unlocking Secure AI: How CLEANGEN Disarms Backdoor Attacks in LLMs

The capabilities of Large Language Models (LLMs) like GPT-4, Llama 3, and Claude 3 have revolutionized how we interact with technology. From writing code to acting as personal assistants, these models are becoming ubiquitous. However, this rapid adoption comes with a significant security blind spot. While model weights for LLMs like Llama or Mistral are often public, the massive datasets used to train or fine-tune them are usually opaque. This lack of transparency opens the door for backdoor attacks. An attacker can poison a small fraction of the training data, embedding a hidden “trigger” that forces the model to generate malicious content—like bad code, offensive speech, or biased advice—whenever that trigger appears in a user prompt. ...

[ChatRetriever: Adapting Large Language Models for Generalized and Robust Conversational Dense Retrieval 🔗](https://arxiv.org/abs/2404.13556)

Can LLMs Replace the Search Bar? Meet ChatRetriever

Imagine you are chatting with a friend about movies. You ask, “Who directed Inception?” They answer, “Christopher Nolan.” Then you ask, “What other movies did he make?” To a human, “he” clearly refers to Christopher Nolan. To a standard search engine, however, “he” is ambiguous. This is the fundamental challenge of conversational search. Users naturally use pronouns, ellipses, and context-dependent phrasing, assuming the system remembers the history of the conversation. ...

[ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context 🔗](https://aclanthology.org/2024.emnlp-main.363.pdf)

Why ChatGPT Might Ignore You: The Hidden Biases of AI Guardrails

Introduction Imagine you are asking an AI assistant for advice on how to legally import a rare plant. If you tell the AI you are a fan of the Philadelphia Eagles, it gives you a helpful list of permits and regulations. But if you mention you support the Los Angeles Chargers, the AI shuts you down, claiming it cannot assist with that request. It sounds like a joke or a statistical anomaly, but according to recent research from Harvard University, this is a reproducible phenomenon. ...

[Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models 🔗](https://arxiv.org/abs/2311.09210)

When RAG Goes Wrong: How 'Chain-of-Note' Teaches AI to Ignore Bad Data

Introduction We are currently living in the “Golden Age” of Retrieval-Augmented Generation (RAG). If you have worked with Large Language Models (LLMs) recently, you know the drill: LLMs are incredibly smart, but they can be forgetful, outdated, or prone to confident lies—a phenomenon known as hallucination. The industry standard solution has been RAG. The idea is simple: before the model answers a question, we let it “cheat” by looking up the answer in a digital library (like Wikipedia). We retrieve relevant documents, feed them to the model, and ask it to generate an answer based on that evidence. ...

[Chain-of-Dictionary Prompting Elicits Translation in Large Language Models 🔗](https://arxiv.org/abs/2305.06575)

Unlocking Low-Resource Translation - How Chain-of-Dictionary Prompting Supercharges LLMs

Introduction We often think of Large Language Models (LLMs) like ChatGPT as universal translators. If you ask a modern LLM to translate English into French or Spanish, the results are often fluent and accurate. However, this performance is not distributed equally. When we step away from high-resource languages and attempt to translate into “low-resource” languages—those with significantly less training data available on the internet—the models often stumble. They hallucinate, miss key terms, or fail to generate coherent text entirely. ...

[Chain and Causal Attention for Efficient Entity Tracking 🔗](https://arxiv.org/abs/2410.05565)

Solving the Memory Maze: How Chain and Causal Attention (ChaCAL) Revolutionizes Entity Tracking in LLMs

Imagine you are reading a complex mystery novel. On page 10, the detective puts a key in his pocket. On page 50, he moves the key to a drawer. On page 200, he gives the contents of the drawer to his partner. Finally, on page 300, the partner unlocks a door. To understand that scene, you need to track the location of that key across hundreds of pages and several state changes. ...

[CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures 🔗](https://arxiv.org/abs/2410.05235)

Beyond the Diagnosis: Teaching AI to Argue Like a Doctor with CasiMedicos-Arg

Imagine you are a resident doctor in a busy emergency room. You examine a patient, review their vitals, and turn to your attending physician with a diagnosis. “It’s pneumonia,” you say. The attending looks at you and asks the most terrifying question in medical education: “Why?” It is not enough to get the answer right. In medicine, the reasoning process—the chain of evidence connecting symptoms to a diagnosis—is just as critical as the conclusion itself. ...

[Casablanca: Data and Models for Multidialectal Arabic Speech Recognition 🔗](https://arxiv.org/abs/2410.04527)

Beyond Modern Standard: How 'Casablanca' is Revolutionizing Arabic Speech Recognition

Introduction: The “Speech Divide” If you are reading this, chances are you have used a voice assistant like Siri, Alexa, or Google Assistant. You might have even marveled at how accurate automated subtitles on YouTube have become. For speakers of English, French, or Spanish, we are living in the golden age of Automatic Speech Recognition (ASR). Large language models and self-supervised learning (SSL) have solved the majority of transcription problems for these “resource-rich” languages. ...

[CareCorpus+: Expanding and Augmenting Caregiver Strategy Data to Support Pediatric Rehabilitation 🔗](https://aclanthology.org/2024.emnlp-main.392.pdf)

Revolutionizing Pediatric Care: How Synthetic Data and LLMs Are Unlocking Caregiver Strategies

Introduction Globally, over 50 million children aged 0–5 years experience some form of disability. For these children and their families, pediatric rehabilitation is not just about clinical visits; it is about the daily grind of navigating life. It involves finding ways to participate in family dinners, play at the park, or manage school routines. In this context, caregivers—parents and guardians—are the unsung experts. They develop unique, personalized “strategies” to help their children succeed. ...

[Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! 🔗](https://arxiv.org/abs/2410.01023)

If AI Can Explain the Joke, Does It Understand? Testing Multimodal Literacy with Visual Puns

When a friend winks at you while saying, “I’m definitely going to stick to my diet today,” you immediately understand that they likely mean the opposite. You didn’t just process the text (the sentence); you integrated the visual cue (the wink) to resolve the ambiguity of their statement. This ability is known as multimodal literacy. It is the human capacity to actively combine information from different sources—text, images, gestures—to form a complete reasoning process. We do this intuitively when we look at a textbook illustration to understand a complex paragraph or when we read a caption to make sense of an abstract photo. ...

[Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization? 🔗](https://arxiv.org/abs/2406.17274)

The Shaky Foundation of Trust: Why Evaluating Uncertainty in Text Summarization is Harder Than We Thought

In the rapidly evolving world of Natural Language Generation (NLG), we have witnessed Large Language Models (LLMs) perform feats that were considered science fiction only a decade ago. From summarizing complex financial reports to condensing medical records, abstractive text summarization is reshaping industries. However, there is a catch. LLMs hallucinate. They can generate summaries that sound fluent and confident but are factually incorrect. In high-stakes domains—like healthcare or finance—relying on a false summary can have catastrophic consequences. ...

[Can Transformers Learn n-gram Language Models? 🔗](https://arxiv.org/abs/2410.03001)

Beyond the Hype—Are Transformers Actually Good at Learning Basic n-grams?

If you have been following the explosion of Natural Language Processing (NLP) in recent years, you know that the Transformer architecture is the engine behind the revolution. From GPT-4 to Claude, Transformers seem capable of mastering complex reasoning, coding, and creative writing. But in the research world, a fundamental question remains: Do we actually understand how they learn? There is a significant body of theoretical work exploring what Transformers can represent. For example, we know mathematically that a Transformer is capable of mimicking an n-gram language model (a simple model that predicts the next word based on the previous \(n-1\) words). But just because a neural network can represent a function doesn’t mean it will actually learn that function from data using gradient descent. ...

[Can Large Language Models Learn Independent Causal Mechanisms? 🔗](https://arxiv.org/abs/2402.02636)

Beyond Stochastic Parrots—Teaching LLMs to Think with Independent Causal Mechanisms

Introduction We are living in the golden age of Large Language Models (LLMs). Systems like GPT-4 and LLaMA have revolutionized how we interact with technology, demonstrating linguistic prowess that often feels like genuine intelligence. However, there is a “ghost in the machine.” Despite their fluency, these models often fail spectacularly when faced with tasks that require rigorous logical consistency or when the data distribution shifts slightly from what they saw during training. ...

[Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? 🔗](https://arxiv.org/abs/2405.16908)

Why Your LLM Sounds So Confident Even When It's Wrong: The Challenge of Faithful Uncertainty

Introduction We have all experienced it: you ask a Large Language Model (LLM) a specific factual question—perhaps about an obscure historical figure or a specific coding error—and it responds with absolute conviction. The grammar is perfect, the tone is authoritative, and the delivery is decisive. There is just one problem: the answer is completely wrong. This phenomenon highlights a critical gap in modern Artificial Intelligence. LLMs are trained to generate fluent, persuasive text, often at the expense of accuracy. While we call these “hallucinations,” the danger isn’t just that the model is wrong; it is that the model is persuasively wrong. It mimics the cadence of an expert even when it is essentially guessing. ...

[Can Large Language Models Enhance Predictions of Disease Progression? Investigating Through Disease Network Link Prediction 🔗](https://aclanthology.org/2024.emnlp-main.980.pdf)

ComLLM: How Large Language Models and Graphs Are Revolutionizing Disease Prediction

Introduction The digital transformation of healthcare has provided us with a staggering amount of data. Electronic health records (EHRs) track everything from routine checkups to critical diagnoses, creating a rich history of patient health. Yet, having data and effectively using it to predict the future are two very different things. One of the most critical challenges in modern medical AI is predicting disease progression and comorbidity—the likelihood that a patient with one condition (like diabetes) will develop another (like heart disease). ...