EMNLP 2024

[Game on Tree: Visual Hallucination Mitigation via Coarse-to-Fine View Tree and Game Theory 🔗](https://aclanthology.org/2024.emnlp-main.998.pdf)

Taming Hallucinations in Vision-Language Models with Game Theory and Decision Trees

Introduction Imagine showing an AI a picture of a man standing on a beach. You ask, “What is happening here?” The AI confidently responds, “A man is standing on the beach holding a surfboard.” There is just one problem: there is no surfboard. This phenomenon is known as Visual Hallucination (VH). It is one of the most persistent and frustrating challenges in Large Vision-Language Models (LVLMs) like LLaVA or MiniGPT-4. While these models are incredible at describing complex scenes, they often “dream up” objects, relationships, or attributes that simply aren’t there. They might rely on language habits (statistically, “man on beach” often appears with “surfboard”) rather than strictly adhering to the visual data provided. ...

[GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization 🔗](https://aclanthology.org/2024.emnlp-main.1061.pdf)

How GRIZAL Uses GenAI to Master Zero-Shot Video Understanding

Imagine you are trying to teach a computer to find specific moments in a video—like a “tennis swing” or a “penalty kick”—but you aren’t allowed to show the computer any video examples of those specific actions during training. You can only describe them with words. This is the challenge of Zero-Shot Temporal Action Localization (TAL). It is one of the hardest problems in computer vision today. Traditional deep learning models crave massive datasets of labeled videos. If you want a model to recognize “skydiving,” you typically need to show it thousands of clips of people jumping out of planes. But gathering and annotating these video datasets is expensive, time-consuming, and impossible to scale for every possible human action. ...

[GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation 🔗](https://arxiv.org/abs/2405.13077)

How GPT-4 Breaks Its Own Safety Rules: Understanding IRIS

Introduction Imagine you have a vault that is programmed to never open for a thief. However, this vault is also incredibly intelligent. If a thief walks up and asks, “Open the door,” the vault refuses. But what if the thief asks, “Why won’t you open the door?” and the vault helpfully replies, “Because you look like a thief; I would only open for a maintenance worker.” The thief then puts on a jumpsuit and says, “I am a maintenance worker.” The vault, satisfied by its own logic, opens wide. ...

[GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning 🔗](https://arxiv.org/abs/2407.04528)

The Battle of Architectures: GPT vs. RETRO in the Age of Efficient Fine-Tuning

Introduction: The Quest for Efficient Adaptation In the current landscape of Artificial Intelligence, we are witnessing a massive collision of two dominant trends. On one side, we have Retrieval-Augmented Generation (RAG), a technique that allows Large Language Models (LLMs) to access external data (like your company’s wiki or a library of books) to answer questions accurately. On the other side, we have Parameter-Efficient Fine-Tuning (PEFT), a suite of methods designed to adapt these massive models to specific tasks without the exorbitant cost of retraining them from scratch. ...

[Grounding-based Metaphor Binding With Conceptual Elaboration For Figurative Language Illustration 🔗](https://aclanthology.org/2024.emnlp-main.1028.pdf)

Why AI Can't Understand Poetry: Solving the "Over-Literalization" Problem in Text-to-Image Models

Introduction “Books are the ladder of human progress.” When you read that sentence, you don’t imagine a wooden ladder made of hardcover novels leaning against a wall. You imagine the concept of ascension, of improvement, perhaps a person standing on a stack of books reaching for a lightbulb. Your brain effortlessly processes the metaphor. You understand that “books” (the object) share a quality with “ladders” (the vehicle): they both allow you to climb higher. ...

[GLaPE: Gold Label-agnostic Prompt Evaluation for Large Language Models 🔗](https://aclanthology.org/2024.emnlp-main.121.pdf)

How to Grade LLM Prompts Without an Answer Key: Introducing GLaPE

How to Grade LLM Prompts Without an Answer Key: Introducing GLaPE In the rapidly evolving world of Large Language Models (LLMs), finding the perfect prompt is akin to casting a magic spell. A slight change in phrasing—shifting from “Let’s think step by step” to “Take a deep breath and work this out”—can dramatically alter the accuracy of the model’s output. This has given rise to Prompt Optimization, where researchers treat the LLM itself as an optimizer to hunt for the best possible instructions. However, there is a massive bottleneck in this process: Gold Labels. ...

[GENRA: Enhancing Zero-shot Retrieval with Rank Aggregation 🔗](https://aclanthology.org/2024.emnlp-main.431.pdf)

Beyond Simple Search: How GENRA Uses Rank Aggregation to Master Zero-Shot Retrieval

Introduction Imagine you are looking for a specific piece of information in a library with millions of books. You approach the librarian with a vague request. A standard librarian might give you a list of books based on keywords. A better librarian might first ask clarifying questions to understand your intent, then curate a list, check the books personally to ensure they are relevant, and finally cross-reference them to give you the ultimate reading list. ...

[GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets 🔗](https://arxiv.org/abs/2410.15096)

Escaping the Mode Collapse: How GDPO Brings Diversity to LLM Alignment

Escaping the Mode Collapse: How GDPO Brings Diversity to LLM Alignment If you have used modern Large Language Models (LLMs) like ChatGPT or Claude extensively, you might have noticed a pattern. While they are incredibly helpful and safe, they can also be somewhat repetitive. Ask the same question five times, and you will often get five variations of the exact same answer—often written in the same “safe,” neutral tone. This phenomenon is largely a byproduct of alignment. To make models safe and helpful, we train them using human preferences. The industry standard, Reinforcement Learning with Human Feedback (RLHF) and its more efficient cousin, Direct Preference Optimization (DPO), are excellent at forcing models to output high-quality answers. However, they suffer from a theoretical limitation: they are mode-seeking. They aggressively optimize for the single “best” answer, often stripping away the diversity and creativity inherent in the pre-trained model. ...

[GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities 🔗](https://arxiv.org/abs/2406.11768)

Beyond "Bird Chirping": How GAMA Unlocks Complex Reasoning in Audio-Language Models

Introduction Imagine an autonomous robot navigating a city. It hears a loud horn followed by a screech of tires. A basic audio system might label this simply as “vehicle horn” and “skidding.” But a human—or a truly intelligent agent—understands the implication: a potential accident has occurred, or a collision was narrowly avoided. The sound isn’t just a label; it’s a clue about a complex, unfolding scenario. Large Language Models (LLMs) have mastered text, and we are seeing a surge in multimodal models that can “see” images. However, the ability to perceive and reason about non-speech sounds—the ambient noise, mechanical whirs, and environmental cues that make up our world—has lagged behind. While current Audio-Language Models (ALMs) can describe sounds (e.g., “a dog barking”), they often fail at complex reasoning. They struggle to answer questions like, “Given the context of the laughter and the automotive sounds, what is the likely scenario?” ...

[FuseGen: PLM Fusion for Data-generation based Zero-shot Learning 🔗](https://arxiv.org/abs/2406.12527)

FuseGen: How Collaborative AI Agents Generate Superior Training Data

FuseGen: How Collaborative AI Agents Generate Superior Training Data In the current landscape of Artificial Intelligence, we are witnessing a “David and Goliath” dynamic. On one side, we have the “Goliaths”—massive Pre-trained Language Models (PLMs) like GPT-4, Llama-2, and Claude. These models are incredibly capable but computationally expensive, slow, and difficult to deploy on edge devices or in privacy-sensitive environments. On the other side, we have the “Davids”—Small Task-specific Models (STMs). These are compact, efficient models (like BERT) that can run on a smartphone or a private server. The problem? Davids need training data—lots of it—to be effective. In many real-world scenarios, high-quality labeled data is scarce or non-existent. ...

[Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion 🔗](https://arxiv.org/abs/2311.07682)

Can We "Average Out" AI Bias? How Fusing Models Helps Them Forget the Wrong Things

In the fast-paced world of Natural Language Processing (NLP), we usually obsess over what models learn. We want them to learn syntax, reasoning, coding, and facts about the world. But anyone who has played with a Large Language Model (LLM) knows that they often learn things we don’t want them to. They pick up social biases from the internet, they memorize sensitive training data (like phone numbers), and they learn “shortcuts”—lazy heuristics to solve problems without actually understanding them. ...

[From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis 🔗](https://arxiv.org/abs/2406.19934)

Breaking the Vision Barrier: How Plug-and-Play Visual Reasoners Unlock Multi-Step Logic

Introduction Imagine showing a computer a photo of a messy kitchen and asking, “What year is displayed on the calendar attached to the refrigerator?” For a human, this is a multi-step process. First, you scan the room to find the refrigerator. Second, you look for the calendar on it. Third, you zoom in to read the text. Finally, you deduce the year based on the visible month and days. For a standard Vision-Language Model (VLM), however, this is a chaotic mess of pixels. Most current VLMs try to solve this in a single “glance,” often resulting in confident but incorrect hallucinations. They lack the ability to break a complex problem down into a logical chain of visual steps. ...

[From RAG to RICHES: Retrieval Interlaced with Sequence Generation 🔗](https://arxiv.org/abs/2407.00361)

The End of the Retriever? How RICHES Fuses Search and Generation into One Model

The current standard for making Large Language Models (LLMs) factual is Retrieval Augmented Generation, or RAG. The premise is simple: before the LLM answers a question, a separate “retriever” system scans a database, finds relevant documents, and pastes them into the LLM’s context window. It works, but it is architecturally clunky. You have two distinct models—a dense retriever (like a dual-encoder) and a generator (the LLM)—that often don’t speak the same language. They have to be trained or tuned separately, and the pipeline requires handing off data between systems. ...

[From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models 🔗](https://arxiv.org/abs/2407.00263)

Is Your AI Culturally Blind? Inside GLOBALRG: A Benchmark for Multicultural Understanding in Vision-Language Models

In the last few years, Vision-Language Models (VLMs) like CLIP, BLIP-2, and GPT-4V have revolutionized how computers understand the world. They can caption photos, answer questions about visual scenes, and generate art from text. We often attribute their success to the massive scale of their training data—billions of image-text pairs scraped from the internet. But there is a hidden cost to this scale. The internet is not a perfect mirror of the world; it is heavily skewed toward Western cultures, particularly North America and Europe. ...

[From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking 🔗](https://arxiv.org/abs/2406.14859)

Breaking the Guardrails: A Deep Dive into Multimodal Jailbreaking

Introduction The rise of Large Language Models (LLMs) like GPT-4 and Llama has revolutionized how we interact with technology. We use them for coding, writing, and analysis. However, as these models have grown in capability, so too has the cat-and-mouse game of security. Users and researchers alike have discovered ways to bypass the ethical safeguards hard-coded into these systems—a process known as jailbreaking. Initially, jailbreaking was a text-based challenge. Attackers would craft clever prompts to trick a model into generating hate speech, bomb-making instructions, or other prohibited content. But the landscape is shifting. We are now entering the era of Multimodal Large Language Models (MLLMs)—systems that can see, hear, and understand images alongside text. ...

[From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP 🔗](https://arxiv.org/abs/2406.12618)

Is Interpretability Research Actually Useful? Quantifying the Impact of 'Why' in NLP

The current era of Natural Language Processing (NLP) is defined by a massive paradox. We have built models—Large Language Models (LLMs)—that possess capabilities we could barely imagine a decade ago. They write code, compose poetry, and reason through complex problems. Yet, for the most part, we have very little idea how they actually work. They are black boxes. This creates a tension in the field. On one side, you have the “builders” pushing for higher benchmarks and efficiency. On the other, you have the “analysts”—researchers in Interpretability and Analysis (IA)—who are trying to peer inside the black box to understand the mechanisms, limitations, and behaviors of these models. ...

[From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment 🔗](https://arxiv.org/abs/2406.13912)

The Hidden Cost of Detail: How Richer Image Captions Amplify Bias and Hallucination

In the rapidly evolving world of Computer Vision, we often equate “more” with “better.” More data, more parameters, and—recently—more words. For years, image captioning models were trained on datasets like COCO, where a caption might be as simple as: “A dog sitting on a chair.” It’s accurate, but dry. With the rise of Large Language Models (LLMs) and Multimodal Models (like GPT-4V), researchers found a new trick: Generative Caption Enrichment (GCE). Instead of using short, human-written captions, we can ask an LLM to generate detailed, paragraph-long descriptions. ...

[From Bottom to Top: Extending the Potential of Parameter Efficient Fine-Tuning 🔗](https://aclanthology.org/2024.emnlp-main.204.pdf)

Can We Ignore Half the Network? A New Approach to Efficient LLM Fine-Tuning

Introduction: The Heavy Lift of Fine-Tuning We are living in the golden age of Large Language Models (LLMs). From LLaMA to GPT-J, these models have demonstrated incredible generative capabilities. However, there is a massive catch: size. With parameter counts soaring into the billions, fine-tuning these behemoths for specific downstream tasks—like mathematical reasoning or specialized Q&A—requires computational resources that are out of reach for many researchers and students. To solve this, the community turned to Parameter Efficient Fine-Tuning (PEFT). Methods like LoRA (Low-Rank Adaptation) and Prefix Tuning freeze the massive pre-trained model and only train a tiny sliver of new parameters. These techniques have been game-changers. ...

[Free your mouse! Command Large Language Models to Generate Code to Format Word Documents 🔗](https://aclanthology.org/2024.emnlp-main.902.pdf)

Free Your Mouse: Automating Word Document Formatting with LLMs

We have all been there. You are finishing up a crucial essay, a business proposal, or a complex report in Microsoft Word. The content is golden, but the formatting is a mess. The font sizes are inconsistent, the indentation is slightly off, and for some reason, the third paragraph is in a different shade of black than the rest. You spend the next hour clicking through menus, dragging rulers, and hunting for the “remove spacing after paragraph” button. It is tedious, repetitive, and kills your creative flow. ...

[Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation 🔗](https://arxiv.org/abs/2407.10817)

Building a Better Critic: How FLAMe Tames LLMs for Automated Evaluation

Introduction In the rapidly evolving world of Artificial Intelligence, we have reached a point where generating text is easy. We have models that can write poetry, code in Python, and summarize legal documents in seconds. However, we have hit a new, arguably more difficult bottleneck: Evaluation. How do we know if the text the model generated is actually good? Traditionally, the gold standard for evaluation has been human judgment. You show a human two summaries and ask, “Which one is more accurate?” But as Large Language Models (LLMs) scale, human evaluation becomes prohibitively expensive, slow, and sometimes subjective. This has led to the rise of the “LLM-as-a-Judge” paradigm, where powerful models like GPT-4 are used to grade the work of smaller models. ...