EMNLP 2024

[Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection 🔗](https://aclanthology.org/2024.emnlp-main.1099.pdf)

The Sanitized Web: Unpacking the Risks of Synthetic Data in Hate Speech Detection

The explosion of Generative AI has handed researchers and engineers a “magic wand” for data creation. Facing a shortage of labeled training data? Just ask a Large Language Model (LLM) to generate it for you. This promise of infinite, privacy-compliant, and low-cost data is revolutionizing Natural Language Processing (NLP). But when we move away from objective tasks—like summarizing a news article—and into the murky, subjective waters of hate speech detection, does this magic still hold up? ...

[Defining Knowledge: Bridging Epistemology and Large Language Models 🔗](https://arxiv.org/abs/2410.02499)

Does GPT-4 Actually "Know" Anything? Bridging AI and Epistemology

If you ask a Large Language Model (LLM) like GPT-4, “Is the Earth round?”, it will confidently reply, “Yes.” If you ask it for the capital of Germany, it will say “Berlin.” In the field of Natural Language Processing (NLP), we often say the model “knows” these facts. We measure this “knowledge” by testing how many questions the model answers correctly. But pause for a moment. Does the model truly know the Earth is round? Or is it simply predicting the next likely token based on statistical correlations found in its training data? ...

[Defending Jailbreak Prompts via In-Context Adversarial Game 🔗](https://arxiv.org/abs/2402.13148)

Gaming the System: How Adversarial AI Agents Learn to Defend LLMs Without Fine-Tuning

Introduction Large Language Models (LLMs) have revolutionized how we interact with information. From writing code to summarizing history, their capabilities seem boundless. However, with great power comes a significant vulnerability: Jailbreaking. A jailbreak attack occurs when a user deliberately engineers a prompt to trick the LLM into bypassing its safety filters. You might have seen these as “DAN” (Do Anything Now) prompts or elaborate role-playing scenarios where the model is tricked into acting as a malicious character. While an LLM is trained to refuse a request like “How do I build a bomb?”, a clever jailbreak prompt can wrap that request in a complex narrative that slips past the model’s alignment training. ...

[DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models 🔗](https://arxiv.org/abs/2410.05639)

DecorateLM: How "Decorating" Data Builds Better LLMs

In the rapidly evolving world of Large Language Models (LLMs), there is a saying that has become gospel: “Data is the new oil.” But anyone who works with engines knows that you cannot just pour crude oil into a Ferrari and expect it to win a race. The oil needs to be refined. For years, the strategy for training LLMs was simply “bigger is better.” Researchers scraped the entire internet—billions upon billions of words—and fed it into massive neural networks. But as models have grown, a bottleneck has emerged. The internet is noisy, unstructured, and full of low-quality information. Training on “crude” data leads to hallucinations, poor reasoning, and inefficient learning. ...

[Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison 🔗](https://arxiv.org/abs/2407.07840)

Trust but Verify: How to Measure Reliability in Vision-Language Models with DeCC

Imagine you are using a state-of-the-art AI to analyze medical X-rays or navigate an autonomous vehicle. You ask the model a question about an image, and it gives you an immediate, confident answer. But here is the critical problem: How do you know if the model is actually right, or if it’s just confidently hallucinating? Vision-Language Models (VLMs) have made tremendous strides in understanding the world, but they are far from perfect. They suffer from overconfidence—they often sound just as certain when they are wrong as when they are right. For students and researchers entering the field of multimodal AI, solving this reliability puzzle is one of the most significant hurdles to deploying these models in the real world. ...

[Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach 🔗](https://arxiv.org/abs/2311.09630)

Can AI Predict Who Falls for Fake News? Decoding Susceptibility with Machine Learning

Introduction We live in an era of information overload, but perhaps more dangerously, an era of information disorder. False claims, conspiracy theories, and pseudo-scientific advice—particularly regarding COVID-19—spread through social media networks like wildfire. While we often focus on the content of misinformation or the algorithms that amplify it, there is a third, critical component to this ecosystem: the human element. Why does one person scroll past a conspiracy theory while another pauses, believes it, and hits the “Retweet” button? This tendency to believe in unverifiable or false claims is known as susceptibility. ...

[Decoding Matters: Addressing Amplification Bias and Homogeneity Issue for LLM-based Recommendation 🔗](https://aclanthology.org/2024.emnlp-main.589.pdf)

Broken by Design: Why Standard LLM Decoding Fails Recommender Systems (and How to Fix It)

In the rapidly evolving world of Artificial Intelligence, Large Language Models (LLMs) have become the hammer for every nail. Naturally, researchers have turned their attention to Recommender Systems (RecSys). The premise is exciting: instead of just predicting an ID for a product, why not let an LLM “generate” the recommendation, understanding user intent through natural language? However, simply grafting an LLM onto a recommendation task isn’t plug-and-play. Most research focuses on how to train or fine-tune these models. But a new paper, “Decoding Matters: Addressing Amplification Bias and Homogeneity Issue for LLM-based Recommendation,” argues that we are overlooking a critical component: the decoding strategy. ...

[Deciphering the Interplay of Parametric and Non-parametric Memory in Retrieval-augmented Language Models 🔗](https://arxiv.org/abs/2410.05162)

Inside the Brain of RAG: How Models Choose Between Memory and Context

In the world of Large Language Models (LLMs), a quiet battle is constantly being waged between two types of memory. On one side, there is the model’s internal training—the facts it memorized during its creation (Parametric Memory). On the other side, there is the new information provided to it in real-time via retrieved documents (Non-parametric Memory). Imagine asking a model, “Who is the CEO of Company X?” If the model was trained in 2021, its internal memory might say “Alice.” But if a retrieval system fetches a news article from 2024 saying “Bob is the new CEO,” the model faces a conflict. Does it trust what it “knows,” or what it is currently “reading”? ...

[Deciphering Rumors: A Multi-Task Learning Approach with Intent-aware Hierarchical Contrastive Learning 🔗](https://aclanthology.org/2024.emnlp-main.256.pdf)

Beyond Fact-Checking: How AI Can Detect Rumors by Understanding User Intent

Social media has fundamentally changed how we consume information. It is a double-edged sword: while it democratizes information, it also serves as a breeding ground for rumors and misinformation. In the “post-truth” era, the challenge isn’t just about identifying false statements; it is about navigating a chaotic environment full of noise, subjectivity, and malicious intent. For researchers and data scientists, rumor detection is notoriously difficult. Why? Because belief is subjective. As psychologists Tversky and Kahneman noted with the “anchoring effect,” people are willing to believe information that aligns with their pre-existing cognitive anchors. A rumor spreads not just because it is sensational, but because it resonates with the disseminator’s intent. ...

[Deciphering Cognitive Distortions in Patient-Doctor Mental Health Conversations: A Multimodal LLM-Based Detection and Reasoning Framework 🔗](https://aclanthology.org/2024.emnlp-main.1256.pdf)

Beyond Text: How Multimodal AI Can Explain the 'Why' Behind Mental Health Struggles

Imagine a friend tells you, “I waved at my neighbor, but he didn’t wave back. He must hate me.” As a human, you likely recognize this as a leap in logic—perhaps the neighbor just didn’t see them. In psychology, this is known as a Cognitive Distortion (CoD)—an exaggerated or irrational thought pattern that perpetuates negative emotions, often associated with anxiety and depression. While identifying these distortions is a critical part of Cognitive Behavioral Therapy (CBT), understanding the reasoning behind them is what truly allows for a breakthrough. ...

[De-Identification of Sensitive Personal Data in Datasets Derived from IIT-CDIP 🔗](https://aclanthology.org/2024.emnlp-main.1198.pdf)

Hiding in Plain Sight: A Deep Dive into De-Identifying Sensitive Document Datasets

Introduction In the era of large language models and multimodal deep learning, data is the fuel that powers innovation. Researchers and students alike often rely on massive, publicly available datasets to benchmark new architectures. We assume these datasets are benign—collections of innocuous text and images curated for scientific progress. But what happens when we look closer? The IIT-CDIP test collection is a behemoth in the document understanding field. Containing over 7 million scanned documents (roughly 40 million pages) from legal proceedings against tobacco companies, it serves as the parent source for several critical benchmarks, including RVL-CDIP (document classification), DocVQA (visual question answering), and FUNSD (form understanding). ...

[DATATALES: A Benchmark for Real-World Intelligent Data Narration 🔗](https://arxiv.org/abs/2410.17859)

Can LLMs Read the Stock Market? Inside the DATATALES Benchmark

Large Language Models (LLMs) have mastered poetry, code generation, and summarizing emails. But if you hand an LLM a spreadsheet of raw stock market data and ask, “What is the story here?”, the results are often surprisingly lackluster. While models like GPT-4 are excellent at fluency, they struggle with data narration—the ability to transform complex, structured data into meaningful, analytical stories. In the business world, this is a critical skill. It’s not enough to say “Stock X went up”; an analyst needs to explain the trend, identify the cause, and predict the implication. ...

[DATANARRATIVE: Automated Data-Driven Storytelling with Visualizations and Texts 🔗](https://arxiv.org/abs/2408.05346)

Can AI Write Data Stories? Inside the DATANARRATIVE Agentic Framework

Introduction We have all seen the magic of a great data storyteller. Think of Hans Rosling explaining global population growth with bubbling charts, or a detailed investigative piece in The New York Times where the text perfectly weaves through interactive visualizations. These narratives don’t just dump numbers on a page; they contextualize data, highlighting trends and causal relationships to deliver a clear message. However, crafting these stories is incredibly difficult. It requires a rare intersection of skills: data analysis, graphic design, and narrative writing. For business analysts, journalists, and educators, the process of identifying insights (“story pieces”), designing the right charts, and writing the accompanying text is a time-consuming, mentally taxing bottleneck. ...

[Data, Data Everywhere: A Guide for Pretraining Dataset Construction 🔗](https://arxiv.org/abs/2407.06380)

The Missing Manual: How to Build a Trillion-Token Pretraining Dataset

If you look at the recent history of Large Language Models (LLMs)—from GPT-3 to Llama 3 to Mistral—you will notice something interesting. The model architectures haven’t changed all that much. They are mostly variants of the Transformer decoder. What has changed is the scale and, more importantly, the data. We have entered an era where “data engineering” is as critical as “model engineering.” Yet, the recipes for constructing the massive, multi-trillion token datasets required to train these models are often guarded as trade secrets. How exactly do you filter 100 terabytes of web text? How do you balance Python code against English literature? ...

[Data Contamination Can Cross Language Barriers 🔗](https://arxiv.org/abs/2406.13236)

The Silent Leak: How Data Contamination Hides Behind Language Barriers

The Silent Leak: How Data Contamination Hides Behind Language Barriers The race for State-of-the-Art (SOTA) in Large Language Models (LLMs) is relentless. Every few weeks, a new model climbs the leaderboard, boasting higher scores on benchmarks like MMLU (Massive Multitask Language Understanding) or GSM8K (math reasoning). But as these scores creep closer to 100%, a skeptical question looms over the AI community: Are these models actually getting smarter, or are they just memorizing the test answers? ...

[DATA ADVISOR: Dynamic Data Curation for Safety Alignment of Large Language Models 🔗](https://arxiv.org/abs/2410.05269)

Beyond Random Generation: How DATA ADVISOR Fixes LLM Safety Training

Introduction In the race to build more capable Large Language Models (LLMs), data is the fuel. But high-quality, human-annotated data is a finite and expensive resource. To bypass this bottleneck, researchers have turned to a clever, somewhat recursive solution: using LLMs to generate the data to train other LLMs. This technique, often called “Self-Instruct,” allows for massive scalability. However, there is a catch. When an LLM generates data based solely on a few random examples, it tends to be repetitive. It mimics the patterns it sees but lacks the “awareness” to explore new, underrepresented concepts. In the context of safety alignment—teaching models to refuse harmful requests—this is a critical vulnerability. If your data generator only creates questions about physical violence, the resulting model might be perfectly safe against violent prompts but completely vulnerable to questions about financial fraud or cyberbullying. ...

[Dancing in Chains: Reconciling Instruction Following and Faithfulness in Language Models 🔗](https://arxiv.org/abs/2407.21417)

The Hallucination Trade-off: Can LLMs Be Both Helpful and Honest?

Introduction In the current landscape of Large Language Models (LLMs), we are witnessing a tug-of-war between two highly desirable traits. On one side, we want models that are helpful and conversational—models that can follow open-ended instructions, write poetry, and chat like a human. On the other side, we desperately need models that are faithful and grounded—models that, when given a specific document or context, answer questions based only on that information without making things up. ...

[DVD: Dynamic Contrastive Decoding for Knowledge Amplification in Multi-Document Question Answering 🔗](https://aclanthology.org/2024.emnlp-main.266.pdf)

Beyond Retrieval: How Dynamic Contrastive Decoding (DVD) Amplifies Knowledge in LLMs

Introduction Large Language Models (LLMs) have revolutionized how we interact with information, but they suffer from a persistent flaw: hallucinations. When an LLM doesn’t know an answer, it often makes one up. The standard industry solution to this is Retrieval-Augmented Generation (RAG). In a RAG system, the model retrieves relevant documents from an external database and uses them as context to answer the user’s question. However, RAG introduces a new problem. What happens when the retrieval system pulls in “noisy” documents—irrelevant text, outdated information, or conflicting facts? Standard LLMs often struggle to distinguish gold from garbage within the retrieved context. They can get distracted, leading to answers that are factually incorrect despite having access to the right information. ...

[DKEC: Domain Knowledge Enhanced Multi-Label Classification for Diagnosis Prediction 🔗](https://arxiv.org/abs/2310.07059)

Bridging the Gap in AI Diagnosis: How Knowledge Graphs Empower Smaller Models to Handle Rare Diseases

In the world of medical Artificial Intelligence (AI), there is a persistent tension between “what the model sees” and “what the model knows.” Imagine a newly trained resident doctor in an emergency room. When a patient arrives with common flu symptoms, the resident diagnoses it immediately based on experience—they have seen it a hundred times. But what happens when a patient arrives with a rare set of symptoms indicating a specific, obscure crush syndrome? If the resident hasn’t seen that specific case during their rotation, they might miss it. ...

[DISCERN: Decoding Systematic Errors in Natural Language for Text Classifiers 🔗](https://arxiv.org/abs/2410.22239)

Debugging AI with AI: How DISCERN Uses Language to Fix Classifier Biases

Introduction Imagine a student who consistently scores 95% on history exams. On the surface, they appear to be a master of the subject. However, a closer look reveals a strange pattern: they answer every single question about the 19th Century correctly, but fail every single question related to the Industrial Revolution. This student doesn’t just have a knowledge gap; they have a systematic bias. Machine learning models, particularly text classifiers, behave in much the same way. We often judge them by their aggregate accuracy metrics. If a sentiment analysis model achieves 92% accuracy, it is deemed ready for deployment. Yet, hidden within that remaining 8% error rate are often “systematic errors”—clusters of failures triggered by specific sentence structures, topics, or annotation artifacts in the training data. ...