Papers

[GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text 🔗](https://arxiv.org/abs/2403.06399)

GlossLM: Bridging the Gap Between NLP and Endangered Language Documentation

There are approximately 7,000 languages spoken in the world today. Tragically, nearly half of them are considered endangered. While communities and linguists are working tirelessly to preserve and revitalize these languages, the process of language documentation is notoriously slow and labor-intensive. Imagine you are a field linguist recording a story from an elder in an endangered language. You have the audio and the transcription. But to make that data useful for dictionaries, grammars, or teaching materials, you need to perform a task called Interlinear Glossing. This involves analyzing the text morpheme-by-morpheme (the smallest units of meaning) and assigning grammatical labels to them. It is a task that requires deep expertise and immense time. ...

[Global Reward to Local Rewards: Multimodal-Guided Decomposition for Improving Dialogue Agents 🔗](https://aclanthology.org/2024.emnlp-main.881.pdf)

From a Head Nod to a Thumbs Up: How Multimodal Signals Teach AI to Hold Better Conversations

Introduction: The “Long Conversation” Problem Imagine you are teaching a friend how to tell a story. If you stop them after every single sentence to say “good job” or “that was boring,” the flow of the conversation is ruined. It’s unnatural. Instead, you usually listen to the whole story and, at the end, give a reaction—perhaps a laugh, a sigh, or a compliment like, “That was a great story!” This dynamic represents a massive bottleneck in the development of Artificial Intelligence, specifically in training Large Language Models (LLMs) to be better conversationalists. ...

[Getting the Most Out of Your Training Data: Exploring Unsupervised Tasks for Morphological Inflection 🔗](https://aclanthology.org/2024.emnlp-main.1055.pdf)

Squeezing the Stone: How Unsupervised Tasks Boost Morphological Inflection with Limited Data

Introduction In the world of Natural Language Processing (NLP), we have become accustomed to the “bigger is better” paradigm. Massive models like BERT or GPT are trained on effectively the entire internet, learning the statistical patterns of language before they are ever shown a specific task. But what happens when we zoom in from the level of sentences and paragraphs to the level of individual characters? And more importantly, what happens when we don’t have the internet’s worth of data for a specific language? ...

[Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners 🔗](https://arxiv.org/abs/2405.13816)

Getting More from Less: How LLMs Spontaneously Become Multilingual Experts

Large Language Models (LLMs) like GPT-4, LLaMA, and Mistral have revolutionized natural language processing. If you speak English, these tools feel almost magical. However, if you switch to a low-resource language—say, Swahili or Bengali—the “magic” often fades. The performance gap between high-resource languages (like English and Chinese) and low-resource languages remains a massive hurdle in AI equity. Traditionally, fixing this required massive amounts of multilingual training data or complex translation pipelines. But what if LLMs already know how to handle these languages, and we just aren’t asking them correctly? ...

[GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation 🔗](https://arxiv.org/abs/2406.11503)

Teaching AI Geometry: How GeoGPT4V Solves the Visual Math Problem

Introduction If you’ve ever sat in a geometry class, you know that the text of a problem is often useless without the diagram next to it. “Find the length of side \(AC\)” means nothing if you can’t see the triangle. This reliance on visual aids makes geometry one of the most challenging frontiers for Artificial Intelligence. While Large Language Models (LLMs) have become incredibly adept at solving text-based math word problems, they hit a wall when distinct visual reasoning is required. Even state-of-the-art Multi-modal Large Language Models (MLLMs)—models that can “see” images and read text—often struggle to match human performance in geometry. They might misinterpret a diagram or fail to connect the visual angles to the numerical values in the text. ...

[Generative Subgraph Retrieval for Knowledge Graph–Grounded Dialog Generation 🔗](https://arxiv.org/abs/2410.09350)

How to Stop LLM Hallucinations - A Deep Dive into DialogGSR and Generative Subgraph Retrieval

Introduction We are living in the golden age of Large Language Models (LLMs). From ChatGPT to Claude, these models can write poetry, code, and casual conversation with frightening fluency. However, anyone who has used them for factual research knows their dirty secret: hallucinations. Because LLMs generate text based on statistical likelihood rather than a database of facts, they can confidently assert falsehoods. To fix this, researchers often turn to Knowledge Graphs (KGs). Instead of relying solely on the model’s internal memory, we ground the conversation in a structured graph of entities and relationships (e.g., Lionel Messi – plays for – Inter Miami). ...

[Generative Models for Automatic Medical Decision Rule Extraction from Text 🔗](https://aclanthology.org/2024.emnlp-main.399.pdf)

From Textbooks to Treatment - Automating Medical Decision Trees with Generative AI

Imagine a doctor facing a patient with a complex set of symptoms. To prescribe the right medication, the doctor mentally traverses a flowchart: Is the condition mild or severe? If severe, is there a complication? If yes, use Drug A; otherwise, use Drug B. This logical roadmap is a Medical Decision Rule (MDR). These rules are the backbone of Clinical Decision Support Systems (CDSS), software that helps medical professionals make safe, evidence-based choices. ...

[Generation with Dynamic Vocabulary 🔗](https://arxiv.org/abs/2410.08481)

Beyond Static Tokens: Revolutionizing Language Models with Dynamic Vocabulary

Introduction In the rapidly evolving world of Large Language Models (LLMs), we often focus on the sheer size of the models—billions of parameters trained on trillions of words. However, there is a fundamental component of these models that has remained surprisingly rigid: the vocabulary. Think of a language model as a builder constructing a house (a sentence). The builder uses bricks (tokens) to create the structure. In the current paradigm, the size and shape of these bricks are determined before the builder even starts learning. Once the “static” vocabulary is defined by a tokenizer (like BPE or WordPiece), it is locked. The model must construct everything, from simple articles to complex technical terms, using this fixed set of bricks. ...

[Generating Demonstrations for In-Context Compositional Generalization in Grounded Language Learning 🔗](https://aclanthology.org/2024.emnlp-main.893.pdf)

Why Retrieve When You Can Create? Solving Compositional Generalization with Generated Demonstrations

Humans are masters of “compositional generalization.” If you know what “spinning” means, and you know what “pulling a red lever” means, you can instantly understand the command “pull the red lever while spinning,” even if you have never physically performed that specific combination of actions before. You don’t need to see a tutorial for every possible combination of words and actions; you understand the components and the rules for combining them. ...

[Generate-on-Graph: Treat LLM as both Agent and KG for Incomplete Knowledge Graph Question Answering 🔗](https://arxiv.org/abs/2404.14741)

Beyond Retrieval: How 'Generate-on-Graph' Solves the Missing Link in Knowledge Graph QA

Large Language Models (LLMs) like GPT-4 and Llama-3 have revolutionized how we interact with information. They are incredibly capable, but they have well-documented flaws: they hallucinate (make things up) and their knowledge is static—cut off at their training date. To fix this, the AI community has largely turned to Knowledge Graphs (KGs). By grounding an LLM’s responses in a structured database of facts (triples like Apple, Headquartered_In, Cupertino), we can theoretically get the best of both worlds: the reasoning of an LLM and the factual accuracy of a database. This field is known as Knowledge Graph Question Answering (KGQA). ...

[Generalizing Clinical De-identification Models by Privacy-safe Data Augmentation using GPT-4 🔗](https://aclanthology.org/2024.emnlp-main.1181.pdf)

Solving the Medical Data Bottleneck: Privacy-Safe Augmentation with GPT-4

Introduction In the era of big data, Electronic Health Records (EHRs) represent a treasure trove of information. They hold the keys to training AI models that can predict diseases, recommend treatments, and optimize hospital operations. However, this data is locked behind a massive ethical and legal gate: patient privacy. Regulations like HIPAA in the United States mandate that Protected Health Information (PHI)—names, dates, IDs, and locations—must be rigorously removed before data can be used for secondary research. ...

[Game on Tree: Visual Hallucination Mitigation via Coarse-to-Fine View Tree and Game Theory 🔗](https://aclanthology.org/2024.emnlp-main.998.pdf)

Taming Hallucinations in Vision-Language Models with Game Theory and Decision Trees

Introduction Imagine showing an AI a picture of a man standing on a beach. You ask, “What is happening here?” The AI confidently responds, “A man is standing on the beach holding a surfboard.” There is just one problem: there is no surfboard. This phenomenon is known as Visual Hallucination (VH). It is one of the most persistent and frustrating challenges in Large Vision-Language Models (LVLMs) like LLaVA or MiniGPT-4. While these models are incredible at describing complex scenes, they often “dream up” objects, relationships, or attributes that simply aren’t there. They might rely on language habits (statistically, “man on beach” often appears with “surfboard”) rather than strictly adhering to the visual data provided. ...

[GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization 🔗](https://aclanthology.org/2024.emnlp-main.1061.pdf)

How GRIZAL Uses GenAI to Master Zero-Shot Video Understanding

Imagine you are trying to teach a computer to find specific moments in a video—like a “tennis swing” or a “penalty kick”—but you aren’t allowed to show the computer any video examples of those specific actions during training. You can only describe them with words. This is the challenge of Zero-Shot Temporal Action Localization (TAL). It is one of the hardest problems in computer vision today. Traditional deep learning models crave massive datasets of labeled videos. If you want a model to recognize “skydiving,” you typically need to show it thousands of clips of people jumping out of planes. But gathering and annotating these video datasets is expensive, time-consuming, and impossible to scale for every possible human action. ...

[GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation 🔗](https://arxiv.org/abs/2405.13077)

How GPT-4 Breaks Its Own Safety Rules: Understanding IRIS

Introduction Imagine you have a vault that is programmed to never open for a thief. However, this vault is also incredibly intelligent. If a thief walks up and asks, “Open the door,” the vault refuses. But what if the thief asks, “Why won’t you open the door?” and the vault helpfully replies, “Because you look like a thief; I would only open for a maintenance worker.” The thief then puts on a jumpsuit and says, “I am a maintenance worker.” The vault, satisfied by its own logic, opens wide. ...

[GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning 🔗](https://arxiv.org/abs/2407.04528)

The Battle of Architectures: GPT vs. RETRO in the Age of Efficient Fine-Tuning

Introduction: The Quest for Efficient Adaptation In the current landscape of Artificial Intelligence, we are witnessing a massive collision of two dominant trends. On one side, we have Retrieval-Augmented Generation (RAG), a technique that allows Large Language Models (LLMs) to access external data (like your company’s wiki or a library of books) to answer questions accurately. On the other side, we have Parameter-Efficient Fine-Tuning (PEFT), a suite of methods designed to adapt these massive models to specific tasks without the exorbitant cost of retraining them from scratch. ...

[Grounding-based Metaphor Binding With Conceptual Elaboration For Figurative Language Illustration 🔗](https://aclanthology.org/2024.emnlp-main.1028.pdf)

Why AI Can't Understand Poetry: Solving the "Over-Literalization" Problem in Text-to-Image Models

Introduction “Books are the ladder of human progress.” When you read that sentence, you don’t imagine a wooden ladder made of hardcover novels leaning against a wall. You imagine the concept of ascension, of improvement, perhaps a person standing on a stack of books reaching for a lightbulb. Your brain effortlessly processes the metaphor. You understand that “books” (the object) share a quality with “ladders” (the vehicle): they both allow you to climb higher. ...

[GLaPE: Gold Label-agnostic Prompt Evaluation for Large Language Models 🔗](https://aclanthology.org/2024.emnlp-main.121.pdf)

How to Grade LLM Prompts Without an Answer Key: Introducing GLaPE

How to Grade LLM Prompts Without an Answer Key: Introducing GLaPE In the rapidly evolving world of Large Language Models (LLMs), finding the perfect prompt is akin to casting a magic spell. A slight change in phrasing—shifting from “Let’s think step by step” to “Take a deep breath and work this out”—can dramatically alter the accuracy of the model’s output. This has given rise to Prompt Optimization, where researchers treat the LLM itself as an optimizer to hunt for the best possible instructions. However, there is a massive bottleneck in this process: Gold Labels. ...

[GENRA: Enhancing Zero-shot Retrieval with Rank Aggregation 🔗](https://aclanthology.org/2024.emnlp-main.431.pdf)

Beyond Simple Search: How GENRA Uses Rank Aggregation to Master Zero-Shot Retrieval

Introduction Imagine you are looking for a specific piece of information in a library with millions of books. You approach the librarian with a vague request. A standard librarian might give you a list of books based on keywords. A better librarian might first ask clarifying questions to understand your intent, then curate a list, check the books personally to ensure they are relevant, and finally cross-reference them to give you the ultimate reading list. ...

[GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets 🔗](https://arxiv.org/abs/2410.15096)

Escaping the Mode Collapse: How GDPO Brings Diversity to LLM Alignment

Escaping the Mode Collapse: How GDPO Brings Diversity to LLM Alignment If you have used modern Large Language Models (LLMs) like ChatGPT or Claude extensively, you might have noticed a pattern. While they are incredibly helpful and safe, they can also be somewhat repetitive. Ask the same question five times, and you will often get five variations of the exact same answer—often written in the same “safe,” neutral tone. This phenomenon is largely a byproduct of alignment. To make models safe and helpful, we train them using human preferences. The industry standard, Reinforcement Learning with Human Feedback (RLHF) and its more efficient cousin, Direct Preference Optimization (DPO), are excellent at forcing models to output high-quality answers. However, they suffer from a theoretical limitation: they are mode-seeking. They aggressively optimize for the single “best” answer, often stripping away the diversity and creativity inherent in the pre-trained model. ...

[GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities 🔗](https://arxiv.org/abs/2406.11768)

Beyond "Bird Chirping": How GAMA Unlocks Complex Reasoning in Audio-Language Models

Introduction Imagine an autonomous robot navigating a city. It hears a loud horn followed by a screech of tires. A basic audio system might label this simply as “vehicle horn” and “skidding.” But a human—or a truly intelligent agent—understands the implication: a potential accident has occurred, or a collision was narrowly avoided. The sound isn’t just a label; it’s a clue about a complex, unfolding scenario. Large Language Models (LLMs) have mastered text, and we are seeing a surge in multimodal models that can “see” images. However, the ability to perceive and reason about non-speech sounds—the ambient noise, mechanical whirs, and environmental cues that make up our world—has lagged behind. While current Audio-Language Models (ALMs) can describe sounds (e.g., “a dog barking”), they often fail at complex reasoning. They struggle to answer questions like, “Given the context of the laughter and the automotive sounds, what is the likely scenario?” ...