Papers

[Grasping the Essentials: Tailoring Large Language Models for Zero-Shot Relation Extraction 🔗](https://arxiv.org/abs/2402.11142)

Can AI Learn Relationships Just from Definitions? Inside the REPAL Framework

In the world of Natural Language Processing (NLP), teaching machines to read text is one thing; teaching them to understand the connections between entities is entirely another. This task is known as Relation Extraction (RE). Imagine you are building a system to analyze news articles. You don’t just want the computer to recognize the words “Steve Jobs” and “Apple.” You want it to extract the specific relationship: FounderOf. Traditionally, this requires training models on massive, human-labeled datasets where thousands of sentences are tagged with specific relationships. But what happens when you need to find a new type of relationship that you haven’t labeled yet? Collecting new data is expensive and slow. ...

[Granular Privacy Control for Geolocation with Vision Language Models 🔗](https://arxiv.org/abs/2407.04952)

When AI Knows Where You Are—Controlling Geolocation Privacy in Vision Language Models

Introduction Imagine you upload a photo of your lunch to social media. You want your friends to know you are enjoying a trip to Paris, but you definitely don’t want strangers figuring out the exact street corner you are standing on, let alone the specific restaurant that might reveal your hotel’s location. For years, privacy on the internet was often treated as binary: either you share something, or you don’t. However, with the rise of powerful Vision Language Models (VLMs) like GPT-4v, the line has blurred. These models possess an uncanny ability to analyze visual data—identifying landmarks, reading blurry text on background signs, and recognizing architectural styles—to pinpoint locations with frightening accuracy. ...

[GOLD COIN: Grounding Large Language Models in Privacy Laws via Contextual Integrity Theory 🔗](https://arxiv.org/abs/2406.11149)

Teaching AI to Judge Privacy: How Contextual Integrity Grounds LLMs in Law

Privacy is rarely black and white. Consider a simple piece of information: a blood test result. If a doctor sends this result to a specialist for a second opinion, it is standard medical practice. However, if that same doctor sends the same result to a marketing firm, it is a severe privacy violation. The data (the blood test) didn’t change. The sender (the doctor) didn’t change. What changed was the context—specifically the recipient and the purpose of the transfer. ...

[Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs 🔗](https://arxiv.org/abs/2410.01188)

Gold Panning in the Tokenizer: An Adaptive Approach to Domain-Specific LLMs

Introduction Large Language Models (LLMs) like GPT-4 and LLaMA are generalists. They can write a poem, solve a math problem, or summarize a history lesson with reasonable competence. However, when you drop these generalist models into a highly specialized environment—such as a law firm or a hospital—they often stumble. They lack the specific jargon and deep domain knowledge required to generate precise legal contracts or medical diagnoses. To bridge this gap, researchers typically turn to Supervised Fine-Tuning (SFT) on domain-specific data. But there is a bottleneck often overlooked: the vocabulary. General models use a vocabulary optimized for general text. When they encounter specialized terms like “hemorrhoids” or complex legal statutes, they often break them down into inefficient, fragmented sub-tokens. ...

[GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text 🔗](https://arxiv.org/abs/2403.06399)

GlossLM: Bridging the Gap Between NLP and Endangered Language Documentation

There are approximately 7,000 languages spoken in the world today. Tragically, nearly half of them are considered endangered. While communities and linguists are working tirelessly to preserve and revitalize these languages, the process of language documentation is notoriously slow and labor-intensive. Imagine you are a field linguist recording a story from an elder in an endangered language. You have the audio and the transcription. But to make that data useful for dictionaries, grammars, or teaching materials, you need to perform a task called Interlinear Glossing. This involves analyzing the text morpheme-by-morpheme (the smallest units of meaning) and assigning grammatical labels to them. It is a task that requires deep expertise and immense time. ...

[Global Reward to Local Rewards: Multimodal-Guided Decomposition for Improving Dialogue Agents 🔗](https://aclanthology.org/2024.emnlp-main.881.pdf)

From a Head Nod to a Thumbs Up: How Multimodal Signals Teach AI to Hold Better Conversations

Introduction: The “Long Conversation” Problem Imagine you are teaching a friend how to tell a story. If you stop them after every single sentence to say “good job” or “that was boring,” the flow of the conversation is ruined. It’s unnatural. Instead, you usually listen to the whole story and, at the end, give a reaction—perhaps a laugh, a sigh, or a compliment like, “That was a great story!” This dynamic represents a massive bottleneck in the development of Artificial Intelligence, specifically in training Large Language Models (LLMs) to be better conversationalists. ...

[Getting the Most Out of Your Training Data: Exploring Unsupervised Tasks for Morphological Inflection 🔗](https://aclanthology.org/2024.emnlp-main.1055.pdf)

Squeezing the Stone: How Unsupervised Tasks Boost Morphological Inflection with Limited Data

Introduction In the world of Natural Language Processing (NLP), we have become accustomed to the “bigger is better” paradigm. Massive models like BERT or GPT are trained on effectively the entire internet, learning the statistical patterns of language before they are ever shown a specific task. But what happens when we zoom in from the level of sentences and paragraphs to the level of individual characters? And more importantly, what happens when we don’t have the internet’s worth of data for a specific language? ...

[Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners 🔗](https://arxiv.org/abs/2405.13816)

Getting More from Less: How LLMs Spontaneously Become Multilingual Experts

Large Language Models (LLMs) like GPT-4, LLaMA, and Mistral have revolutionized natural language processing. If you speak English, these tools feel almost magical. However, if you switch to a low-resource language—say, Swahili or Bengali—the “magic” often fades. The performance gap between high-resource languages (like English and Chinese) and low-resource languages remains a massive hurdle in AI equity. Traditionally, fixing this required massive amounts of multilingual training data or complex translation pipelines. But what if LLMs already know how to handle these languages, and we just aren’t asking them correctly? ...

[GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation 🔗](https://arxiv.org/abs/2406.11503)

Teaching AI Geometry: How GeoGPT4V Solves the Visual Math Problem

Introduction If you’ve ever sat in a geometry class, you know that the text of a problem is often useless without the diagram next to it. “Find the length of side \(AC\)” means nothing if you can’t see the triangle. This reliance on visual aids makes geometry one of the most challenging frontiers for Artificial Intelligence. While Large Language Models (LLMs) have become incredibly adept at solving text-based math word problems, they hit a wall when distinct visual reasoning is required. Even state-of-the-art Multi-modal Large Language Models (MLLMs)—models that can “see” images and read text—often struggle to match human performance in geometry. They might misinterpret a diagram or fail to connect the visual angles to the numerical values in the text. ...

[Generative Subgraph Retrieval for Knowledge Graph–Grounded Dialog Generation 🔗](https://arxiv.org/abs/2410.09350)

How to Stop LLM Hallucinations - A Deep Dive into DialogGSR and Generative Subgraph Retrieval

Introduction We are living in the golden age of Large Language Models (LLMs). From ChatGPT to Claude, these models can write poetry, code, and casual conversation with frightening fluency. However, anyone who has used them for factual research knows their dirty secret: hallucinations. Because LLMs generate text based on statistical likelihood rather than a database of facts, they can confidently assert falsehoods. To fix this, researchers often turn to Knowledge Graphs (KGs). Instead of relying solely on the model’s internal memory, we ground the conversation in a structured graph of entities and relationships (e.g., Lionel Messi – plays for – Inter Miami). ...

[Generative Models for Automatic Medical Decision Rule Extraction from Text 🔗](https://aclanthology.org/2024.emnlp-main.399.pdf)

From Textbooks to Treatment - Automating Medical Decision Trees with Generative AI

Imagine a doctor facing a patient with a complex set of symptoms. To prescribe the right medication, the doctor mentally traverses a flowchart: Is the condition mild or severe? If severe, is there a complication? If yes, use Drug A; otherwise, use Drug B. This logical roadmap is a Medical Decision Rule (MDR). These rules are the backbone of Clinical Decision Support Systems (CDSS), software that helps medical professionals make safe, evidence-based choices. ...

[Generation with Dynamic Vocabulary 🔗](https://arxiv.org/abs/2410.08481)

Beyond Static Tokens: Revolutionizing Language Models with Dynamic Vocabulary

Introduction In the rapidly evolving world of Large Language Models (LLMs), we often focus on the sheer size of the models—billions of parameters trained on trillions of words. However, there is a fundamental component of these models that has remained surprisingly rigid: the vocabulary. Think of a language model as a builder constructing a house (a sentence). The builder uses bricks (tokens) to create the structure. In the current paradigm, the size and shape of these bricks are determined before the builder even starts learning. Once the “static” vocabulary is defined by a tokenizer (like BPE or WordPiece), it is locked. The model must construct everything, from simple articles to complex technical terms, using this fixed set of bricks. ...

[Generating Demonstrations for In-Context Compositional Generalization in Grounded Language Learning 🔗](https://aclanthology.org/2024.emnlp-main.893.pdf)

Why Retrieve When You Can Create? Solving Compositional Generalization with Generated Demonstrations

Humans are masters of “compositional generalization.” If you know what “spinning” means, and you know what “pulling a red lever” means, you can instantly understand the command “pull the red lever while spinning,” even if you have never physically performed that specific combination of actions before. You don’t need to see a tutorial for every possible combination of words and actions; you understand the components and the rules for combining them. ...

[Generate-on-Graph: Treat LLM as both Agent and KG for Incomplete Knowledge Graph Question Answering 🔗](https://arxiv.org/abs/2404.14741)

Beyond Retrieval: How 'Generate-on-Graph' Solves the Missing Link in Knowledge Graph QA

Large Language Models (LLMs) like GPT-4 and Llama-3 have revolutionized how we interact with information. They are incredibly capable, but they have well-documented flaws: they hallucinate (make things up) and their knowledge is static—cut off at their training date. To fix this, the AI community has largely turned to Knowledge Graphs (KGs). By grounding an LLM’s responses in a structured database of facts (triples like Apple, Headquartered_In, Cupertino), we can theoretically get the best of both worlds: the reasoning of an LLM and the factual accuracy of a database. This field is known as Knowledge Graph Question Answering (KGQA). ...

[Generalizing Clinical De-identification Models by Privacy-safe Data Augmentation using GPT-4 🔗](https://aclanthology.org/2024.emnlp-main.1181.pdf)

Solving the Medical Data Bottleneck: Privacy-Safe Augmentation with GPT-4

Introduction In the era of big data, Electronic Health Records (EHRs) represent a treasure trove of information. They hold the keys to training AI models that can predict diseases, recommend treatments, and optimize hospital operations. However, this data is locked behind a massive ethical and legal gate: patient privacy. Regulations like HIPAA in the United States mandate that Protected Health Information (PHI)—names, dates, IDs, and locations—must be rigorously removed before data can be used for secondary research. ...

[Game on Tree: Visual Hallucination Mitigation via Coarse-to-Fine View Tree and Game Theory 🔗](https://aclanthology.org/2024.emnlp-main.998.pdf)

Taming Hallucinations in Vision-Language Models with Game Theory and Decision Trees

Introduction Imagine showing an AI a picture of a man standing on a beach. You ask, “What is happening here?” The AI confidently responds, “A man is standing on the beach holding a surfboard.” There is just one problem: there is no surfboard. This phenomenon is known as Visual Hallucination (VH). It is one of the most persistent and frustrating challenges in Large Vision-Language Models (LVLMs) like LLaVA or MiniGPT-4. While these models are incredible at describing complex scenes, they often “dream up” objects, relationships, or attributes that simply aren’t there. They might rely on language habits (statistically, “man on beach” often appears with “surfboard”) rather than strictly adhering to the visual data provided. ...

[GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization 🔗](https://aclanthology.org/2024.emnlp-main.1061.pdf)

How GRIZAL Uses GenAI to Master Zero-Shot Video Understanding

Imagine you are trying to teach a computer to find specific moments in a video—like a “tennis swing” or a “penalty kick”—but you aren’t allowed to show the computer any video examples of those specific actions during training. You can only describe them with words. This is the challenge of Zero-Shot Temporal Action Localization (TAL). It is one of the hardest problems in computer vision today. Traditional deep learning models crave massive datasets of labeled videos. If you want a model to recognize “skydiving,” you typically need to show it thousands of clips of people jumping out of planes. But gathering and annotating these video datasets is expensive, time-consuming, and impossible to scale for every possible human action. ...

[GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation 🔗](https://arxiv.org/abs/2405.13077)

How GPT-4 Breaks Its Own Safety Rules: Understanding IRIS

Introduction Imagine you have a vault that is programmed to never open for a thief. However, this vault is also incredibly intelligent. If a thief walks up and asks, “Open the door,” the vault refuses. But what if the thief asks, “Why won’t you open the door?” and the vault helpfully replies, “Because you look like a thief; I would only open for a maintenance worker.” The thief then puts on a jumpsuit and says, “I am a maintenance worker.” The vault, satisfied by its own logic, opens wide. ...

[GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning 🔗](https://arxiv.org/abs/2407.04528)

The Battle of Architectures: GPT vs. RETRO in the Age of Efficient Fine-Tuning

Introduction: The Quest for Efficient Adaptation In the current landscape of Artificial Intelligence, we are witnessing a massive collision of two dominant trends. On one side, we have Retrieval-Augmented Generation (RAG), a technique that allows Large Language Models (LLMs) to access external data (like your company’s wiki or a library of books) to answer questions accurately. On the other side, we have Parameter-Efficient Fine-Tuning (PEFT), a suite of methods designed to adapt these massive models to specific tasks without the exorbitant cost of retraining them from scratch. ...

[Grounding-based Metaphor Binding With Conceptual Elaboration For Figurative Language Illustration 🔗](https://aclanthology.org/2024.emnlp-main.1028.pdf)

Why AI Can't Understand Poetry: Solving the "Over-Literalization" Problem in Text-to-Image Models

Introduction “Books are the ladder of human progress.” When you read that sentence, you don’t imagine a wooden ladder made of hardcover novels leaning against a wall. You imagine the concept of ascension, of improvement, perhaps a person standing on a stack of books reaching for a lightbulb. Your brain effortlessly processes the metaphor. You understand that “books” (the object) share a quality with “ladders” (the vehicle): they both allow you to climb higher. ...