Introduction
In the rapidly evolving landscape of Artificial Intelligence, Multimodal Large Language Models (MLLMs) like LLaVA and MiniGPT-4 represent a massive leap forward. These models don’t just read text; they “see” images and can converse about them. However, for all their impressive capabilities, MLLMs suffer from a persistent and frustrating glitch: hallucination.
Hallucination occurs when the model confidently describes objects that simply aren’t there. Imagine showing an AI a picture of a living room, and it describes a “vintage red telephone” on the table when the table is empty.

As shown in Figure 1 above, the model sees an airplane flying. The generated text claims the “landing gear is visibly down.” However, a closer look at the image reveals no such thing. While this might seem like a minor error, in critical applications—like medical imaging or autonomous navigation—such fabrications can be dangerous.
Current methods to fix this involve expensive human annotation or computationally heavy reinforcement learning (RLHF). But what if there were a way to fix these hallucinations without human labelers and with a fraction of the computing power?
This post breaks down a fascinating paper, “EFUF: Efficient Fine-Grained Unlearning Framework,” which proposes a novel method to force models to “unlearn” their bad habits efficiently.
The Problem with Current Fixes
To understand why EFUF is significant, we first need to look at how researchers currently handle hallucinations. The approaches generally fall into two buckets:
- Inference-based methods: These use external tools or “self-reflection” prompts to check the model’s work after it generates text. While effective, this slows down the chat experience significantly and increases inference costs.
- Finetuning-based methods: These involve retraining the model to align it better with reality. Techniques like RLHF (Reinforcement Learning from Human Feedback) or DPO (Direct Preference Optimization) are popular here.
The catch? Finetuning methods are resource-hungry. They typically require paired data—thousands of examples of “good” responses vs. “hallucinated” responses. Creating this dataset usually requires expensive human labor or advanced proprietary models like GPT-4. Furthermore, running these alignment algorithms requires massive computational resources (high GPU hours).
The Core Hypothesis: Can We Automate Detection?
The researchers behind EFUF asked a critical question: Can we distinguish between a real object and a hallucinated object without human help?
Their hypothesis was that the CLIP model—a neural network trained to understand the relationship between images and text—could serve as an automated judge. If an MLLM mentions an object (e.g., “landing gear”), and we compare that text to the image using CLIP, the similarity score should be high if the object is present and low if it is hallucinated.
To test this, they analyzed the distribution of CLIP scores for known hallucinated vs. non-hallucinated objects.

As Figure 2 illustrates, there is a distinct separation. The orange curves (Hallucinated) cluster around lower similarity scores, while the blue curves (Non-hallucinated) peak at higher scores. This statistical gap confirms that we can use CLIP scores to automatically categorize generated text as “true” or “hallucinated” by setting specific thresholds.
The EFUF Methodology
The Efficient Fine-Grained Unlearning Framework (EFUF) operates in two main stages: Dataset Formation and the Unlearning Process.

Figure 3 provides a high-level view of the workflow. Instead of retraining the whole model on general data, EFUF specifically targets the parts of the model’s generation that cause hallucinations.
Stage 1: Constructing the Dataset (Without Humans)
The framework starts by prompting the MLLM to describe images. It then extracts specific objects from the generated text and calculates the CLIP similarity score for each. Based on the thresholds discovered in the preliminary experiment (Figure 2), the data is split into three categories:
- Positive Subsentences (\(D^+\)): Parts of the text containing objects that truly exist in the image (high CLIP score).
- Negative Subsentences (\(D^-\)): Parts of the text containing hallucinated objects (low CLIP score).
- Sentence Samples (\(D^s\)): Full responses that are generally high-quality, used to maintain the model’s ability to speak fluently.
The researchers define the Positive Dataset (\(D^+\)) and Negative Dataset (\(D^-\)) mathematically as:

Here, \(S(o_i^j)\) is the CLIP score for a specific object. If it exceeds threshold \(T_0\), it’s positive. If it falls below \(T_1\), it’s negative. The framework isolates the specific “subsentence” (cur) containing the object, rather than throwing out the whole response. This granularity is key to the method’s efficiency.
Stage 2: Fine-Grained Unlearning
Now comes the actual “unlearning.” The goal is to update the model parameters so that it becomes less likely to generate the text in the negative dataset and more likely (or at least stable) regarding the positive dataset.
Standard training uses Gradient Descent to minimize loss (error). Unlearning uses Gradient Ascent on the bad data to maximize the loss—effectively telling the model, “Don’t do this.”
The researchers designed a composite loss function consisting of three parts.
1. The Negative Loss (\(L_{neg}\))
This is the unlearning component. We take the standard finetuning loss (\(L_{ft}\)) for the hallucinated samples and invert it (add a negative sign). By performing gradient descent on this inverted loss, we are actually performing gradient ascent on the likelihood of the hallucination.

2. The Positive Loss (\(L_{pos}\))
We don’t want the model to forget what real objects look like while it’s unlearning the fake ones. So, standard supervised finetuning is applied to the positive subsentences.

3. The Sentence Loss (\(L_{sent}\))
Unlearning can sometimes damage a model’s language capabilities, causing it to generate broken grammar or incoherent sentences. To prevent this “catastrophic forgetting” of language skills, the framework applies standard loss to high-quality full sentences (\(D^s\)).

The Total Loss
The final objective function combines these three elements, using weights (\(\lambda_1\) and \(\lambda_2\)) to balance the unlearning intensity with language preservation.

Experiments and Results
Does this mathematical juggling actually work? The researchers tested EFUF on several leading MLLMs, including MiniGPT4, LLaVA, mPLUG-owl, and ShareGPT4V.
Reduction in Hallucinations
The primary metric for success is the CHAIR score (Caption Hallucination Assessment with Image Relevance), where a lower score is better. They also used POPE (a polling-based evaluation) and Human Evaluation.

Table 2 shows impressive results. For every model tested, adding EFUF significantly reduced the hallucination rate (CHAIR metrics dropped by ~15%). Crucially, the Generation Quality (measured by BLEU scores and Informativeness) actually increased. This is rare; usually, aligning a model to be “safer” makes it less creative or informative. EFUF manages to reduce errors while keeping the text rich and accurate.
Why “Fine-Grained” Matters (Ablation Study)
You might wonder: Why chop the text into subsentences? Why not just unlearn full bad sentences?
The researchers performed an ablation study to isolate the impact of the fine-grained approach and the sentence loss.

Table 3 reveals a critical insight.
- Vanilla unlearning (full sentences): Reduces hallucinations slightly but isn’t very effective.
- Fine-grained unlearning (no sentence loss): Achieves the lowest hallucination rate but drastically hurts fluency (high perplexity, denoted as
ppl.). The model stops making grammatical sense. - EFUF (Combined): Strikes the perfect balance. It drastically cuts hallucinations while maintaining low perplexity (high fluency).
Efficiency and Training Cost
One of the strongest selling points of EFUF is its name: Efficient. Because it automates dataset creation and targets specific subsentences, it is incredibly fast compared to RLHF or DPO.

As shown in Figure 4, EFUF (the red line) requires roughly 3 GPU hours to train on an A100. In contrast, RLHF requires roughly 20 hours, and DPO requires between 8 and 16. This makes EFUF highly accessible for researchers and students who don’t have access to massive industrial compute clusters.
Conclusion
The EFUF paper presents a compelling step forward for Multimodal Large Language Models. By cleverly utilizing CLIP scores to automate data labeling and employing a fine-grained unlearning strategy, the researchers created a framework that:
- Eliminates the need for human annotation, solving the data bottleneck.
- Significantly reduces hallucinations across multiple state-of-the-art models.
- Preserves language fluency, avoiding the common pitfall of model degradation during unlearning.
- Operates with high efficiency, requiring a fraction of the time and compute of existing alignment methods.
For students and practitioners in the field, EFUF demonstrates that “more data” isn’t always the answer. Sometimes, the solution lies in smarter data utilization and targeted mathematical interventions—teaching the model not just what to learn, but specifically what to forget.
](https://deep-paper.org/en/paper/2402.09801/images/cover.png)