Papers

[Demystifying Verbatim Memorization in Large Language Models 🔗](https://arxiv.org/abs/2407.17817)

Why Your LLM Can't Keep a Secret: The Science of Verbatim Memorization

In the world of Large Language Models (LLMs), there is a ghost in the machine. Sometimes, models like GPT-4 or Claude don’t just generate novel text—they recite specific training data word-for-word. This phenomenon, known as verbatim memorization, ranges from the innocuous (reciting the Gettysburg Address) to the legally hazardous (reproducing copyrighted code or private identifying information). For years, researchers have treated this as a bug to be squashed. The prevailing assumption has been that specific “bad” weights or neurons are hoarding these memories, and if we could just locate and prune them, the problem would vanish. ...

[Democratizing Large Language Models via Personalized Parameter-Efficient Fine-tuning 🔗](https://arxiv.org/abs/2402.04401)

Your Own Private Slice of the Brain: Democratizing LLMs with One PEFT Per User (OPPU)

Imagine you have a personal assistant who has been with you for ten years. When you ask them to “write an email to the boss,” they don’t need a ten-page style guide or a stack of your previous emails to get the tone right. They just know how you sound. They know you prefer “Best regards” over “Sincerely,” and that you tend to be concise on Mondays. Now, compare that to a Large Language Model (LLM) like GPT-4 or Llama-2. These models are incredibly capable, but they are “one-size-fits-all.” To make them sound like you, you usually have to stuff the prompt with examples of your writing or detailed instructions. This is the current state of personalization in AI: it’s mostly done through prompt engineering and context retrieval. ...

[Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection 🔗](https://aclanthology.org/2024.emnlp-main.1099.pdf)

The Sanitized Web: Unpacking the Risks of Synthetic Data in Hate Speech Detection

The explosion of Generative AI has handed researchers and engineers a “magic wand” for data creation. Facing a shortage of labeled training data? Just ask a Large Language Model (LLM) to generate it for you. This promise of infinite, privacy-compliant, and low-cost data is revolutionizing Natural Language Processing (NLP). But when we move away from objective tasks—like summarizing a news article—and into the murky, subjective waters of hate speech detection, does this magic still hold up? ...

[Defining Knowledge: Bridging Epistemology and Large Language Models 🔗](https://arxiv.org/abs/2410.02499)

Does GPT-4 Actually "Know" Anything? Bridging AI and Epistemology

If you ask a Large Language Model (LLM) like GPT-4, “Is the Earth round?”, it will confidently reply, “Yes.” If you ask it for the capital of Germany, it will say “Berlin.” In the field of Natural Language Processing (NLP), we often say the model “knows” these facts. We measure this “knowledge” by testing how many questions the model answers correctly. But pause for a moment. Does the model truly know the Earth is round? Or is it simply predicting the next likely token based on statistical correlations found in its training data? ...

[Defending Jailbreak Prompts via In-Context Adversarial Game 🔗](https://arxiv.org/abs/2402.13148)

Gaming the System: How Adversarial AI Agents Learn to Defend LLMs Without Fine-Tuning

Introduction Large Language Models (LLMs) have revolutionized how we interact with information. From writing code to summarizing history, their capabilities seem boundless. However, with great power comes a significant vulnerability: Jailbreaking. A jailbreak attack occurs when a user deliberately engineers a prompt to trick the LLM into bypassing its safety filters. You might have seen these as “DAN” (Do Anything Now) prompts or elaborate role-playing scenarios where the model is tricked into acting as a malicious character. While an LLM is trained to refuse a request like “How do I build a bomb?”, a clever jailbreak prompt can wrap that request in a complex narrative that slips past the model’s alignment training. ...

[DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models 🔗](https://arxiv.org/abs/2410.05639)

DecorateLM: How "Decorating" Data Builds Better LLMs

In the rapidly evolving world of Large Language Models (LLMs), there is a saying that has become gospel: “Data is the new oil.” But anyone who works with engines knows that you cannot just pour crude oil into a Ferrari and expect it to win a race. The oil needs to be refined. For years, the strategy for training LLMs was simply “bigger is better.” Researchers scraped the entire internet—billions upon billions of words—and fed it into massive neural networks. But as models have grown, a bottleneck has emerged. The internet is noisy, unstructured, and full of low-quality information. Training on “crude” data leads to hallucinations, poor reasoning, and inefficient learning. ...

[Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison 🔗](https://arxiv.org/abs/2407.07840)

Trust but Verify: How to Measure Reliability in Vision-Language Models with DeCC

Imagine you are using a state-of-the-art AI to analyze medical X-rays or navigate an autonomous vehicle. You ask the model a question about an image, and it gives you an immediate, confident answer. But here is the critical problem: How do you know if the model is actually right, or if it’s just confidently hallucinating? Vision-Language Models (VLMs) have made tremendous strides in understanding the world, but they are far from perfect. They suffer from overconfidence—they often sound just as certain when they are wrong as when they are right. For students and researchers entering the field of multimodal AI, solving this reliability puzzle is one of the most significant hurdles to deploying these models in the real world. ...

[Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach 🔗](https://arxiv.org/abs/2311.09630)

Can AI Predict Who Falls for Fake News? Decoding Susceptibility with Machine Learning

Introduction We live in an era of information overload, but perhaps more dangerously, an era of information disorder. False claims, conspiracy theories, and pseudo-scientific advice—particularly regarding COVID-19—spread through social media networks like wildfire. While we often focus on the content of misinformation or the algorithms that amplify it, there is a third, critical component to this ecosystem: the human element. Why does one person scroll past a conspiracy theory while another pauses, believes it, and hits the “Retweet” button? This tendency to believe in unverifiable or false claims is known as susceptibility. ...

[Decoding Matters: Addressing Amplification Bias and Homogeneity Issue for LLM-based Recommendation 🔗](https://aclanthology.org/2024.emnlp-main.589.pdf)

Broken by Design: Why Standard LLM Decoding Fails Recommender Systems (and How to Fix It)

In the rapidly evolving world of Artificial Intelligence, Large Language Models (LLMs) have become the hammer for every nail. Naturally, researchers have turned their attention to Recommender Systems (RecSys). The premise is exciting: instead of just predicting an ID for a product, why not let an LLM “generate” the recommendation, understanding user intent through natural language? However, simply grafting an LLM onto a recommendation task isn’t plug-and-play. Most research focuses on how to train or fine-tune these models. But a new paper, “Decoding Matters: Addressing Amplification Bias and Homogeneity Issue for LLM-based Recommendation,” argues that we are overlooking a critical component: the decoding strategy. ...

[Deciphering the Interplay of Parametric and Non-parametric Memory in Retrieval-augmented Language Models 🔗](https://arxiv.org/abs/2410.05162)

Inside the Brain of RAG: How Models Choose Between Memory and Context

In the world of Large Language Models (LLMs), a quiet battle is constantly being waged between two types of memory. On one side, there is the model’s internal training—the facts it memorized during its creation (Parametric Memory). On the other side, there is the new information provided to it in real-time via retrieved documents (Non-parametric Memory). Imagine asking a model, “Who is the CEO of Company X?” If the model was trained in 2021, its internal memory might say “Alice.” But if a retrieval system fetches a news article from 2024 saying “Bob is the new CEO,” the model faces a conflict. Does it trust what it “knows,” or what it is currently “reading”? ...

[Deciphering Rumors: A Multi-Task Learning Approach with Intent-aware Hierarchical Contrastive Learning 🔗](https://aclanthology.org/2024.emnlp-main.256.pdf)

Beyond Fact-Checking: How AI Can Detect Rumors by Understanding User Intent

Social media has fundamentally changed how we consume information. It is a double-edged sword: while it democratizes information, it also serves as a breeding ground for rumors and misinformation. In the “post-truth” era, the challenge isn’t just about identifying false statements; it is about navigating a chaotic environment full of noise, subjectivity, and malicious intent. For researchers and data scientists, rumor detection is notoriously difficult. Why? Because belief is subjective. As psychologists Tversky and Kahneman noted with the “anchoring effect,” people are willing to believe information that aligns with their pre-existing cognitive anchors. A rumor spreads not just because it is sensational, but because it resonates with the disseminator’s intent. ...

[Deciphering Cognitive Distortions in Patient-Doctor Mental Health Conversations: A Multimodal LLM-Based Detection and Reasoning Framework 🔗](https://aclanthology.org/2024.emnlp-main.1256.pdf)

Beyond Text: How Multimodal AI Can Explain the 'Why' Behind Mental Health Struggles

Imagine a friend tells you, “I waved at my neighbor, but he didn’t wave back. He must hate me.” As a human, you likely recognize this as a leap in logic—perhaps the neighbor just didn’t see them. In psychology, this is known as a Cognitive Distortion (CoD)—an exaggerated or irrational thought pattern that perpetuates negative emotions, often associated with anxiety and depression. While identifying these distortions is a critical part of Cognitive Behavioral Therapy (CBT), understanding the reasoning behind them is what truly allows for a breakthrough. ...

[De-Identification of Sensitive Personal Data in Datasets Derived from IIT-CDIP 🔗](https://aclanthology.org/2024.emnlp-main.1198.pdf)

Hiding in Plain Sight: A Deep Dive into De-Identifying Sensitive Document Datasets

Introduction In the era of large language models and multimodal deep learning, data is the fuel that powers innovation. Researchers and students alike often rely on massive, publicly available datasets to benchmark new architectures. We assume these datasets are benign—collections of innocuous text and images curated for scientific progress. But what happens when we look closer? The IIT-CDIP test collection is a behemoth in the document understanding field. Containing over 7 million scanned documents (roughly 40 million pages) from legal proceedings against tobacco companies, it serves as the parent source for several critical benchmarks, including RVL-CDIP (document classification), DocVQA (visual question answering), and FUNSD (form understanding). ...

[DATATALES: A Benchmark for Real-World Intelligent Data Narration 🔗](https://arxiv.org/abs/2410.17859)

Can LLMs Read the Stock Market? Inside the DATATALES Benchmark

Large Language Models (LLMs) have mastered poetry, code generation, and summarizing emails. But if you hand an LLM a spreadsheet of raw stock market data and ask, “What is the story here?”, the results are often surprisingly lackluster. While models like GPT-4 are excellent at fluency, they struggle with data narration—the ability to transform complex, structured data into meaningful, analytical stories. In the business world, this is a critical skill. It’s not enough to say “Stock X went up”; an analyst needs to explain the trend, identify the cause, and predict the implication. ...

[DATANARRATIVE: Automated Data-Driven Storytelling with Visualizations and Texts 🔗](https://arxiv.org/abs/2408.05346)

Can AI Write Data Stories? Inside the DATANARRATIVE Agentic Framework

Introduction We have all seen the magic of a great data storyteller. Think of Hans Rosling explaining global population growth with bubbling charts, or a detailed investigative piece in The New York Times where the text perfectly weaves through interactive visualizations. These narratives don’t just dump numbers on a page; they contextualize data, highlighting trends and causal relationships to deliver a clear message. However, crafting these stories is incredibly difficult. It requires a rare intersection of skills: data analysis, graphic design, and narrative writing. For business analysts, journalists, and educators, the process of identifying insights (“story pieces”), designing the right charts, and writing the accompanying text is a time-consuming, mentally taxing bottleneck. ...

[Data, Data Everywhere: A Guide for Pretraining Dataset Construction 🔗](https://arxiv.org/abs/2407.06380)

The Missing Manual: How to Build a Trillion-Token Pretraining Dataset

If you look at the recent history of Large Language Models (LLMs)—from GPT-3 to Llama 3 to Mistral—you will notice something interesting. The model architectures haven’t changed all that much. They are mostly variants of the Transformer decoder. What has changed is the scale and, more importantly, the data. We have entered an era where “data engineering” is as critical as “model engineering.” Yet, the recipes for constructing the massive, multi-trillion token datasets required to train these models are often guarded as trade secrets. How exactly do you filter 100 terabytes of web text? How do you balance Python code against English literature? ...

[Data Contamination Can Cross Language Barriers 🔗](https://arxiv.org/abs/2406.13236)

The Silent Leak: How Data Contamination Hides Behind Language Barriers

The Silent Leak: How Data Contamination Hides Behind Language Barriers The race for State-of-the-Art (SOTA) in Large Language Models (LLMs) is relentless. Every few weeks, a new model climbs the leaderboard, boasting higher scores on benchmarks like MMLU (Massive Multitask Language Understanding) or GSM8K (math reasoning). But as these scores creep closer to 100%, a skeptical question looms over the AI community: Are these models actually getting smarter, or are they just memorizing the test answers? ...

[DATA ADVISOR: Dynamic Data Curation for Safety Alignment of Large Language Models 🔗](https://arxiv.org/abs/2410.05269)

Beyond Random Generation: How DATA ADVISOR Fixes LLM Safety Training

Introduction In the race to build more capable Large Language Models (LLMs), data is the fuel. But high-quality, human-annotated data is a finite and expensive resource. To bypass this bottleneck, researchers have turned to a clever, somewhat recursive solution: using LLMs to generate the data to train other LLMs. This technique, often called “Self-Instruct,” allows for massive scalability. However, there is a catch. When an LLM generates data based solely on a few random examples, it tends to be repetitive. It mimics the patterns it sees but lacks the “awareness” to explore new, underrepresented concepts. In the context of safety alignment—teaching models to refuse harmful requests—this is a critical vulnerability. If your data generator only creates questions about physical violence, the resulting model might be perfectly safe against violent prompts but completely vulnerable to questions about financial fraud or cyberbullying. ...

[Dancing in Chains: Reconciling Instruction Following and Faithfulness in Language Models 🔗](https://arxiv.org/abs/2407.21417)

The Hallucination Trade-off: Can LLMs Be Both Helpful and Honest?

Introduction In the current landscape of Large Language Models (LLMs), we are witnessing a tug-of-war between two highly desirable traits. On one side, we want models that are helpful and conversational—models that can follow open-ended instructions, write poetry, and chat like a human. On the other side, we desperately need models that are faithful and grounded—models that, when given a specific document or context, answer questions based only on that information without making things up. ...

[DVD: Dynamic Contrastive Decoding for Knowledge Amplification in Multi-Document Question Answering 🔗](https://aclanthology.org/2024.emnlp-main.266.pdf)

Beyond Retrieval: How Dynamic Contrastive Decoding (DVD) Amplifies Knowledge in LLMs

Introduction Large Language Models (LLMs) have revolutionized how we interact with information, but they suffer from a persistent flaw: hallucinations. When an LLM doesn’t know an answer, it often makes one up. The standard industry solution to this is Retrieval-Augmented Generation (RAG). In a RAG system, the model retrieves relevant documents from an external database and uses them as context to answer the user’s question. However, RAG introduces a new problem. What happens when the retrieval system pulls in “noisy” documents—irrelevant text, outdated information, or conflicting facts? Standard LLMs often struggle to distinguish gold from garbage within the retrieved context. They can get distracted, leading to answers that are factually incorrect despite having access to the right information. ...