Papers

[GLaPE: Gold Label-agnostic Prompt Evaluation for Large Language Models 🔗](https://aclanthology.org/2024.emnlp-main.121.pdf)

How to Grade LLM Prompts Without an Answer Key: Introducing GLaPE

How to Grade LLM Prompts Without an Answer Key: Introducing GLaPE In the rapidly evolving world of Large Language Models (LLMs), finding the perfect prompt is akin to casting a magic spell. A slight change in phrasing—shifting from “Let’s think step by step” to “Take a deep breath and work this out”—can dramatically alter the accuracy of the model’s output. This has given rise to Prompt Optimization, where researchers treat the LLM itself as an optimizer to hunt for the best possible instructions. However, there is a massive bottleneck in this process: Gold Labels. ...

[GENRA: Enhancing Zero-shot Retrieval with Rank Aggregation 🔗](https://aclanthology.org/2024.emnlp-main.431.pdf)

Beyond Simple Search: How GENRA Uses Rank Aggregation to Master Zero-Shot Retrieval

Introduction Imagine you are looking for a specific piece of information in a library with millions of books. You approach the librarian with a vague request. A standard librarian might give you a list of books based on keywords. A better librarian might first ask clarifying questions to understand your intent, then curate a list, check the books personally to ensure they are relevant, and finally cross-reference them to give you the ultimate reading list. ...

[GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets 🔗](https://arxiv.org/abs/2410.15096)

Escaping the Mode Collapse: How GDPO Brings Diversity to LLM Alignment

Escaping the Mode Collapse: How GDPO Brings Diversity to LLM Alignment If you have used modern Large Language Models (LLMs) like ChatGPT or Claude extensively, you might have noticed a pattern. While they are incredibly helpful and safe, they can also be somewhat repetitive. Ask the same question five times, and you will often get five variations of the exact same answer—often written in the same “safe,” neutral tone. This phenomenon is largely a byproduct of alignment. To make models safe and helpful, we train them using human preferences. The industry standard, Reinforcement Learning with Human Feedback (RLHF) and its more efficient cousin, Direct Preference Optimization (DPO), are excellent at forcing models to output high-quality answers. However, they suffer from a theoretical limitation: they are mode-seeking. They aggressively optimize for the single “best” answer, often stripping away the diversity and creativity inherent in the pre-trained model. ...

[GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities 🔗](https://arxiv.org/abs/2406.11768)

Beyond "Bird Chirping": How GAMA Unlocks Complex Reasoning in Audio-Language Models

Introduction Imagine an autonomous robot navigating a city. It hears a loud horn followed by a screech of tires. A basic audio system might label this simply as “vehicle horn” and “skidding.” But a human—or a truly intelligent agent—understands the implication: a potential accident has occurred, or a collision was narrowly avoided. The sound isn’t just a label; it’s a clue about a complex, unfolding scenario. Large Language Models (LLMs) have mastered text, and we are seeing a surge in multimodal models that can “see” images. However, the ability to perceive and reason about non-speech sounds—the ambient noise, mechanical whirs, and environmental cues that make up our world—has lagged behind. While current Audio-Language Models (ALMs) can describe sounds (e.g., “a dog barking”), they often fail at complex reasoning. They struggle to answer questions like, “Given the context of the laughter and the automotive sounds, what is the likely scenario?” ...

[FuseGen: PLM Fusion for Data-generation based Zero-shot Learning 🔗](https://arxiv.org/abs/2406.12527)

FuseGen: How Collaborative AI Agents Generate Superior Training Data

FuseGen: How Collaborative AI Agents Generate Superior Training Data In the current landscape of Artificial Intelligence, we are witnessing a “David and Goliath” dynamic. On one side, we have the “Goliaths”—massive Pre-trained Language Models (PLMs) like GPT-4, Llama-2, and Claude. These models are incredibly capable but computationally expensive, slow, and difficult to deploy on edge devices or in privacy-sensitive environments. On the other side, we have the “Davids”—Small Task-specific Models (STMs). These are compact, efficient models (like BERT) that can run on a smartphone or a private server. The problem? Davids need training data—lots of it—to be effective. In many real-world scenarios, high-quality labeled data is scarce or non-existent. ...

[Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion 🔗](https://arxiv.org/abs/2311.07682)

Can We "Average Out" AI Bias? How Fusing Models Helps Them Forget the Wrong Things

In the fast-paced world of Natural Language Processing (NLP), we usually obsess over what models learn. We want them to learn syntax, reasoning, coding, and facts about the world. But anyone who has played with a Large Language Model (LLM) knows that they often learn things we don’t want them to. They pick up social biases from the internet, they memorize sensitive training data (like phone numbers), and they learn “shortcuts”—lazy heuristics to solve problems without actually understanding them. ...

[From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis 🔗](https://arxiv.org/abs/2406.19934)

Breaking the Vision Barrier: How Plug-and-Play Visual Reasoners Unlock Multi-Step Logic

Introduction Imagine showing a computer a photo of a messy kitchen and asking, “What year is displayed on the calendar attached to the refrigerator?” For a human, this is a multi-step process. First, you scan the room to find the refrigerator. Second, you look for the calendar on it. Third, you zoom in to read the text. Finally, you deduce the year based on the visible month and days. For a standard Vision-Language Model (VLM), however, this is a chaotic mess of pixels. Most current VLMs try to solve this in a single “glance,” often resulting in confident but incorrect hallucinations. They lack the ability to break a complex problem down into a logical chain of visual steps. ...

[From RAG to RICHES: Retrieval Interlaced with Sequence Generation 🔗](https://arxiv.org/abs/2407.00361)

The End of the Retriever? How RICHES Fuses Search and Generation into One Model

The current standard for making Large Language Models (LLMs) factual is Retrieval Augmented Generation, or RAG. The premise is simple: before the LLM answers a question, a separate “retriever” system scans a database, finds relevant documents, and pastes them into the LLM’s context window. It works, but it is architecturally clunky. You have two distinct models—a dense retriever (like a dual-encoder) and a generator (the LLM)—that often don’t speak the same language. They have to be trained or tuned separately, and the pipeline requires handing off data between systems. ...

[From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models 🔗](https://arxiv.org/abs/2407.00263)

Is Your AI Culturally Blind? Inside GLOBALRG: A Benchmark for Multicultural Understanding in Vision-Language Models

In the last few years, Vision-Language Models (VLMs) like CLIP, BLIP-2, and GPT-4V have revolutionized how computers understand the world. They can caption photos, answer questions about visual scenes, and generate art from text. We often attribute their success to the massive scale of their training data—billions of image-text pairs scraped from the internet. But there is a hidden cost to this scale. The internet is not a perfect mirror of the world; it is heavily skewed toward Western cultures, particularly North America and Europe. ...

[From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking 🔗](https://arxiv.org/abs/2406.14859)

Breaking the Guardrails: A Deep Dive into Multimodal Jailbreaking

Introduction The rise of Large Language Models (LLMs) like GPT-4 and Llama has revolutionized how we interact with technology. We use them for coding, writing, and analysis. However, as these models have grown in capability, so too has the cat-and-mouse game of security. Users and researchers alike have discovered ways to bypass the ethical safeguards hard-coded into these systems—a process known as jailbreaking. Initially, jailbreaking was a text-based challenge. Attackers would craft clever prompts to trick a model into generating hate speech, bomb-making instructions, or other prohibited content. But the landscape is shifting. We are now entering the era of Multimodal Large Language Models (MLLMs)—systems that can see, hear, and understand images alongside text. ...

[From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP 🔗](https://arxiv.org/abs/2406.12618)

Is Interpretability Research Actually Useful? Quantifying the Impact of 'Why' in NLP

The current era of Natural Language Processing (NLP) is defined by a massive paradox. We have built models—Large Language Models (LLMs)—that possess capabilities we could barely imagine a decade ago. They write code, compose poetry, and reason through complex problems. Yet, for the most part, we have very little idea how they actually work. They are black boxes. This creates a tension in the field. On one side, you have the “builders” pushing for higher benchmarks and efficiency. On the other, you have the “analysts”—researchers in Interpretability and Analysis (IA)—who are trying to peer inside the black box to understand the mechanisms, limitations, and behaviors of these models. ...

[From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment 🔗](https://arxiv.org/abs/2406.13912)

The Hidden Cost of Detail: How Richer Image Captions Amplify Bias and Hallucination

In the rapidly evolving world of Computer Vision, we often equate “more” with “better.” More data, more parameters, and—recently—more words. For years, image captioning models were trained on datasets like COCO, where a caption might be as simple as: “A dog sitting on a chair.” It’s accurate, but dry. With the rise of Large Language Models (LLMs) and Multimodal Models (like GPT-4V), researchers found a new trick: Generative Caption Enrichment (GCE). Instead of using short, human-written captions, we can ask an LLM to generate detailed, paragraph-long descriptions. ...

[From Bottom to Top: Extending the Potential of Parameter Efficient Fine-Tuning 🔗](https://aclanthology.org/2024.emnlp-main.204.pdf)

Can We Ignore Half the Network? A New Approach to Efficient LLM Fine-Tuning

Introduction: The Heavy Lift of Fine-Tuning We are living in the golden age of Large Language Models (LLMs). From LLaMA to GPT-J, these models have demonstrated incredible generative capabilities. However, there is a massive catch: size. With parameter counts soaring into the billions, fine-tuning these behemoths for specific downstream tasks—like mathematical reasoning or specialized Q&A—requires computational resources that are out of reach for many researchers and students. To solve this, the community turned to Parameter Efficient Fine-Tuning (PEFT). Methods like LoRA (Low-Rank Adaptation) and Prefix Tuning freeze the massive pre-trained model and only train a tiny sliver of new parameters. These techniques have been game-changers. ...

[Free your mouse! Command Large Language Models to Generate Code to Format Word Documents 🔗](https://aclanthology.org/2024.emnlp-main.902.pdf)

Free Your Mouse: Automating Word Document Formatting with LLMs

We have all been there. You are finishing up a crucial essay, a business proposal, or a complex report in Microsoft Word. The content is golden, but the formatting is a mess. The font sizes are inconsistent, the indentation is slightly off, and for some reason, the third paragraph is in a different shade of black than the rest. You spend the next hour clicking through menus, dragging rulers, and hunting for the “remove spacing after paragraph” button. It is tedious, repetitive, and kills your creative flow. ...

[Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation 🔗](https://arxiv.org/abs/2407.10817)

Building a Better Critic: How FLAMe Tames LLMs for Automated Evaluation

Introduction In the rapidly evolving world of Artificial Intelligence, we have reached a point where generating text is easy. We have models that can write poetry, code in Python, and summarize legal documents in seconds. However, we have hit a new, arguably more difficult bottleneck: Evaluation. How do we know if the text the model generated is actually good? Traditionally, the gold standard for evaluation has been human judgment. You show a human two summaries and ask, “Which one is more accurate?” But as Large Language Models (LLMs) scale, human evaluation becomes prohibitively expensive, slow, and sometimes subjective. This has led to the rise of the “LLM-as-a-Judge” paradigm, where powerful models like GPT-4 are used to grade the work of smaller models. ...

[Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge 🔗](https://arxiv.org/abs/2410.04784)

Why LLMs Trust Textbooks Over Tweets: Unraveling Learning Preferences in Conflicting Data

Imagine you are browsing the internet trying to find the birth date of a historical figure. You find two conflicting sources. One is a scanned PDF of an academic biography written by a historian. The other is a comment on a social media thread that is riddled with spelling errors. Which one do you trust? Almost instinctively, you trust the academic biography. You rely on heuristics—mental shortcuts—that tell you formal language, proper editing, and authoritative tone correlate with truth. ...

[Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models 🔗](https://arxiv.org/abs/2410.04727)

Beyond Perplexity: Measuring LLM Memory with the Forgetting Curve

In the rapidly evolving landscape of Large Language Models (LLMs), there is a massive push for longer context windows. We’ve gone from models that could handle a few paragraphs to beasts claiming to process 128k, 200k, or even 1 million tokens. But here is the critical question: just because a model accepts a million tokens, does it actually remember them? For students and researchers entering this field, this is a tricky problem. We traditionally rely on metrics like Perplexity (PPL) or tasks like “Needle in a Haystack” to evaluate models. However, a new research paper titled “Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models” argues that these existing methods are fundamentally flawed when it comes to long-range memory. ...

[Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting 🔗](https://aclanthology.org/2024.emnlp-main.1051.pdf)

The Trap of Eloquence: Why Textual AI Explanations Can Fool Doctors

The integration of Artificial Intelligence into healthcare is no longer a futuristic concept; it is happening now. From diagnosing skin lesions to predicting patient outcomes, AI models are becoming powerful tools in the clinician’s arsenal. However, with great power comes the “black box” problem. Deep learning models, particularly in medical imaging, are notoriously opaque. We know what they decide, but we rarely know why. To bridge this gap, the field of Explainable AI (XAI) has exploded in popularity. The logic is sound: if an AI can explain its reasoning, doctors can trust it when it’s right and—crucially—catch it when it’s wrong. ...

[FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture 🔗](https://arxiv.org/abs/2406.11030)

Can AI Order Dinner? Testing Cultural Nuance with FoodieQA

Introduction: The Hotpot Dilemma Imagine walking into a restaurant in Beijing. You order “hotpot.” You receive a copper pot with plain water and ginger, accompanied by thinly sliced mutton and sesame dipping sauce. Now, imagine doing the same thing in Chongqing. You receive a bubbling cauldron of beef tallow, packed with chili peppers and numbing peppercorns, served with duck intestines. Same name, entirely different cultural experiences. For humans, these distinctions are part of our cultural fabric. We understand that food isn’t just a collection of ingredients; it is tied to geography, history, and local tradition. But for Artificial Intelligence, specifically Vision-Language Models (VLMs), this level of fine-grained cultural understanding is a massive blind spot. ...

[Focused Large Language Models are Stable Many-Shot Learners 🔗](https://arxiv.org/abs/2408.13987)

Why More Isn't Always Better: Fixing Attention Dispersion in Many-Shot In-Context Learning

Large Language Models (LLMs) have transformed the landscape of Artificial Intelligence, largely due to their ability to perform In-Context Learning (ICL). This is the capability where a model learns to solve a task simply by looking at a few examples (demonstrations) provided in the prompt, without any parameter updates. The prevailing wisdom—and the scaling law that governs much of deep learning—suggests that “more is better.” If giving an LLM five examples helps it understand a task, giving it a hundred examples should make it an expert. However, recent empirical studies have uncovered a baffling phenomenon: as the number of demonstrations increases from “few-shot” to “many-shot,” performance often plateaus or even degrades. ...