EMNLP 2024

[HalluMeasure: Fine-grained Hallucination Measurement Using Chain-of-Thought Reasoning 🔗](https://aclanthology.org/2024.emnlp-main.837.pdf)

How to Catch a Lying AI: Inside HalluMeasure's Chain-of-Thought Approach

Introduction Imagine a lawyer walking into a courtroom, confident in their case, only to be sanctioned by the judge because the legal precedents they cited didn’t exist. Or consider a company’s stock value dropping by $100 billion because their AI demo claimed the James Webb Space Telescope took the first picture of an exoplanet (it didn’t). These aren’t hypothetical scenarios; they are real-world consequences of Large Language Model (LLM) hallucinations. As LLMs become integrated into search engines, customer service bots, and professional workflows, the cost of “making things up” becomes increasingly high. ...

[HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding 🔗](https://arxiv.org/abs/2409.20429)

Curing Multimodal Hallucinations: A Deep Dive into HELPD

Introduction Imagine showing an AI a picture of a snowy forest and asking it to describe what it sees. The model confidently describes the snow, the trees, and then adds, “…and a squirrel is eating a nut on the branch.” You look closer. There is a squirrel, but it’s jumping, not eating. And it’s on the ground, not a branch. This phenomenon is known as hallucination. In Large Vision-Language Models (LVLMs)—the systems behind tools like GPT-4V or LLaVA—it is a persistent and critical issue. While these models have demonstrated incredible prowess in understanding visual data, they frequently generate content that either contradicts the image or invents objects that simply aren’t there. ...

[HEART-felt Narratives: Tracing Empathy and Narrative Style in Personal Stories with LLMs 🔗](https://arxiv.org/abs/2405.17633)

How Storytelling Style Drives Empathy: Introducing the HEART Taxonomy

Why does one story leave you in tears while another, describing a similar tragedy, leaves you feeling indifferent? For psychologists and computer scientists alike, empathy is a fascinating, complex mechanism. It is the cornerstone of prosocial behavior—the engine that drives us to help others and build community. Traditionally, we assume empathy is triggered by content: the tragic loss, the triumphant win, or the relatable struggle. But intuitively, we know that the way a story is told—its narrative style—plays a massive role in how it lands. ...

[GuardBench: A Large-Scale Benchmark for Guardrail Models 🔗](https://aclanthology.org/2024.emnlp-main.1022.pdf)

Guarding the AI: A Deep Dive into GuardBench and the State of LLM Safety

Introduction The rapid deployment of Large Language Models (LLMs) has revolutionized how we interact with technology, from coding assistants to creative writing partners. However, this capabilities boom comes with a significant “dark side.” Without proper alignment and safety measures, these powerful models can be misused to generate hate speech, provide instructions for illegal acts, or output harmful medical advice. To mitigate these risks, the industry has turned to guardrail models. These are specialized AI systems designed to act as input-output filters—monitoring what a user types into a chat and what the LLM types back. If a guardrail detects something unsafe, it blocks the interaction. ...

[Grounding Language in Multi-Perspective Referential Communication 🔗](https://arxiv.org/abs/2410.03959)

Can You See What I See? Teaching AI to Communicate Across Perspectives

Imagine you are helping a friend find their lost keys. You are standing in the doorway, and they are behind the kitchen island. You see the keys on the counter, but from their angle, the keys are hidden behind a fruit bowl. If you simply say, “It’s on the counter,” they might not see them. If you say, “It’s to your left, behind the apples,” they find them immediately. This everyday interaction requires a complex cognitive ability known as perspective-taking (or Theory of Mind). You aren’t just describing what you see; you are modeling what your friend sees and adjusting your language accordingly. ...

[Grasping the Essentials: Tailoring Large Language Models for Zero-Shot Relation Extraction 🔗](https://arxiv.org/abs/2402.11142)

Can AI Learn Relationships Just from Definitions? Inside the REPAL Framework

In the world of Natural Language Processing (NLP), teaching machines to read text is one thing; teaching them to understand the connections between entities is entirely another. This task is known as Relation Extraction (RE). Imagine you are building a system to analyze news articles. You don’t just want the computer to recognize the words “Steve Jobs” and “Apple.” You want it to extract the specific relationship: FounderOf. Traditionally, this requires training models on massive, human-labeled datasets where thousands of sentences are tagged with specific relationships. But what happens when you need to find a new type of relationship that you haven’t labeled yet? Collecting new data is expensive and slow. ...

[Granular Privacy Control for Geolocation with Vision Language Models 🔗](https://arxiv.org/abs/2407.04952)

When AI Knows Where You Are—Controlling Geolocation Privacy in Vision Language Models

Introduction Imagine you upload a photo of your lunch to social media. You want your friends to know you are enjoying a trip to Paris, but you definitely don’t want strangers figuring out the exact street corner you are standing on, let alone the specific restaurant that might reveal your hotel’s location. For years, privacy on the internet was often treated as binary: either you share something, or you don’t. However, with the rise of powerful Vision Language Models (VLMs) like GPT-4v, the line has blurred. These models possess an uncanny ability to analyze visual data—identifying landmarks, reading blurry text on background signs, and recognizing architectural styles—to pinpoint locations with frightening accuracy. ...

[GOLD COIN: Grounding Large Language Models in Privacy Laws via Contextual Integrity Theory 🔗](https://arxiv.org/abs/2406.11149)

Teaching AI to Judge Privacy: How Contextual Integrity Grounds LLMs in Law

Privacy is rarely black and white. Consider a simple piece of information: a blood test result. If a doctor sends this result to a specialist for a second opinion, it is standard medical practice. However, if that same doctor sends the same result to a marketing firm, it is a severe privacy violation. The data (the blood test) didn’t change. The sender (the doctor) didn’t change. What changed was the context—specifically the recipient and the purpose of the transfer. ...

[Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs 🔗](https://arxiv.org/abs/2410.01188)

Gold Panning in the Tokenizer: An Adaptive Approach to Domain-Specific LLMs

Introduction Large Language Models (LLMs) like GPT-4 and LLaMA are generalists. They can write a poem, solve a math problem, or summarize a history lesson with reasonable competence. However, when you drop these generalist models into a highly specialized environment—such as a law firm or a hospital—they often stumble. They lack the specific jargon and deep domain knowledge required to generate precise legal contracts or medical diagnoses. To bridge this gap, researchers typically turn to Supervised Fine-Tuning (SFT) on domain-specific data. But there is a bottleneck often overlooked: the vocabulary. General models use a vocabulary optimized for general text. When they encounter specialized terms like “hemorrhoids” or complex legal statutes, they often break them down into inefficient, fragmented sub-tokens. ...

[GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text 🔗](https://arxiv.org/abs/2403.06399)

GlossLM: Bridging the Gap Between NLP and Endangered Language Documentation

There are approximately 7,000 languages spoken in the world today. Tragically, nearly half of them are considered endangered. While communities and linguists are working tirelessly to preserve and revitalize these languages, the process of language documentation is notoriously slow and labor-intensive. Imagine you are a field linguist recording a story from an elder in an endangered language. You have the audio and the transcription. But to make that data useful for dictionaries, grammars, or teaching materials, you need to perform a task called Interlinear Glossing. This involves analyzing the text morpheme-by-morpheme (the smallest units of meaning) and assigning grammatical labels to them. It is a task that requires deep expertise and immense time. ...

[Global Reward to Local Rewards: Multimodal-Guided Decomposition for Improving Dialogue Agents 🔗](https://aclanthology.org/2024.emnlp-main.881.pdf)

From a Head Nod to a Thumbs Up: How Multimodal Signals Teach AI to Hold Better Conversations

Introduction: The “Long Conversation” Problem Imagine you are teaching a friend how to tell a story. If you stop them after every single sentence to say “good job” or “that was boring,” the flow of the conversation is ruined. It’s unnatural. Instead, you usually listen to the whole story and, at the end, give a reaction—perhaps a laugh, a sigh, or a compliment like, “That was a great story!” This dynamic represents a massive bottleneck in the development of Artificial Intelligence, specifically in training Large Language Models (LLMs) to be better conversationalists. ...

[Getting the Most Out of Your Training Data: Exploring Unsupervised Tasks for Morphological Inflection 🔗](https://aclanthology.org/2024.emnlp-main.1055.pdf)

Squeezing the Stone: How Unsupervised Tasks Boost Morphological Inflection with Limited Data

Introduction In the world of Natural Language Processing (NLP), we have become accustomed to the “bigger is better” paradigm. Massive models like BERT or GPT are trained on effectively the entire internet, learning the statistical patterns of language before they are ever shown a specific task. But what happens when we zoom in from the level of sentences and paragraphs to the level of individual characters? And more importantly, what happens when we don’t have the internet’s worth of data for a specific language? ...

[Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners 🔗](https://arxiv.org/abs/2405.13816)

Getting More from Less: How LLMs Spontaneously Become Multilingual Experts

Large Language Models (LLMs) like GPT-4, LLaMA, and Mistral have revolutionized natural language processing. If you speak English, these tools feel almost magical. However, if you switch to a low-resource language—say, Swahili or Bengali—the “magic” often fades. The performance gap between high-resource languages (like English and Chinese) and low-resource languages remains a massive hurdle in AI equity. Traditionally, fixing this required massive amounts of multilingual training data or complex translation pipelines. But what if LLMs already know how to handle these languages, and we just aren’t asking them correctly? ...

[GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation 🔗](https://arxiv.org/abs/2406.11503)

Teaching AI Geometry: How GeoGPT4V Solves the Visual Math Problem

Introduction If you’ve ever sat in a geometry class, you know that the text of a problem is often useless without the diagram next to it. “Find the length of side $AC$” means nothing if you can’t see the triangle. This reliance on visual aids makes geometry one of the most challenging frontiers for Artificial Intelligence. While Large Language Models (LLMs) have become incredibly adept at solving text-based math word problems, they hit a wall when distinct visual reasoning is required. Even state-of-the-art Multi-modal Large Language Models (MLLMs)—models that can “see” images and read text—often struggle to match human performance in geometry. They might misinterpret a diagram or fail to connect the visual angles to the numerical values in the text. ...

[Generative Subgraph Retrieval for Knowledge Graph–Grounded Dialog Generation 🔗](https://arxiv.org/abs/2410.09350)

How to Stop LLM Hallucinations - A Deep Dive into DialogGSR and Generative Subgraph Retrieval

Introduction We are living in the golden age of Large Language Models (LLMs). From ChatGPT to Claude, these models can write poetry, code, and casual conversation with frightening fluency. However, anyone who has used them for factual research knows their dirty secret: hallucinations. Because LLMs generate text based on statistical likelihood rather than a database of facts, they can confidently assert falsehoods. To fix this, researchers often turn to Knowledge Graphs (KGs). Instead of relying solely on the model’s internal memory, we ground the conversation in a structured graph of entities and relationships (e.g., Lionel Messi – plays for – Inter Miami). ...

[Generative Models for Automatic Medical Decision Rule Extraction from Text 🔗](https://aclanthology.org/2024.emnlp-main.399.pdf)

From Textbooks to Treatment - Automating Medical Decision Trees with Generative AI

Imagine a doctor facing a patient with a complex set of symptoms. To prescribe the right medication, the doctor mentally traverses a flowchart: Is the condition mild or severe? If severe, is there a complication? If yes, use Drug A; otherwise, use Drug B. This logical roadmap is a Medical Decision Rule (MDR). These rules are the backbone of Clinical Decision Support Systems (CDSS), software that helps medical professionals make safe, evidence-based choices. ...

[Generation with Dynamic Vocabulary 🔗](https://arxiv.org/abs/2410.08481)

Beyond Static Tokens: Revolutionizing Language Models with Dynamic Vocabulary

Introduction In the rapidly evolving world of Large Language Models (LLMs), we often focus on the sheer size of the models—billions of parameters trained on trillions of words. However, there is a fundamental component of these models that has remained surprisingly rigid: the vocabulary. Think of a language model as a builder constructing a house (a sentence). The builder uses bricks (tokens) to create the structure. In the current paradigm, the size and shape of these bricks are determined before the builder even starts learning. Once the “static” vocabulary is defined by a tokenizer (like BPE or WordPiece), it is locked. The model must construct everything, from simple articles to complex technical terms, using this fixed set of bricks. ...

[Generating Demonstrations for In-Context Compositional Generalization in Grounded Language Learning 🔗](https://aclanthology.org/2024.emnlp-main.893.pdf)

Why Retrieve When You Can Create? Solving Compositional Generalization with Generated Demonstrations

Humans are masters of “compositional generalization.” If you know what “spinning” means, and you know what “pulling a red lever” means, you can instantly understand the command “pull the red lever while spinning,” even if you have never physically performed that specific combination of actions before. You don’t need to see a tutorial for every possible combination of words and actions; you understand the components and the rules for combining them. ...

[Generate-on-Graph: Treat LLM as both Agent and KG for Incomplete Knowledge Graph Question Answering 🔗](https://arxiv.org/abs/2404.14741)

Beyond Retrieval: How 'Generate-on-Graph' Solves the Missing Link in Knowledge Graph QA

Large Language Models (LLMs) like GPT-4 and Llama-3 have revolutionized how we interact with information. They are incredibly capable, but they have well-documented flaws: they hallucinate (make things up) and their knowledge is static—cut off at their training date. To fix this, the AI community has largely turned to Knowledge Graphs (KGs). By grounding an LLM’s responses in a structured database of facts (triples like Apple, Headquartered_In, Cupertino), we can theoretically get the best of both worlds: the reasoning of an LLM and the factual accuracy of a database. This field is known as Knowledge Graph Question Answering (KGQA). ...

[Generalizing Clinical De-identification Models by Privacy-safe Data Augmentation using GPT-4 🔗](https://aclanthology.org/2024.emnlp-main.1181.pdf)

Solving the Medical Data Bottleneck: Privacy-Safe Augmentation with GPT-4

Introduction In the era of big data, Electronic Health Records (EHRs) represent a treasure trove of information. They hold the keys to training AI models that can predict diseases, recommend treatments, and optimize hospital operations. However, this data is locked behind a massive ethical and legal gate: patient privacy. Regulations like HIPAA in the United States mandate that Protected Health Information (PHI)—names, dates, IDs, and locations—must be rigorously removed before data can be used for secondary research. ...