In the rapidly evolving world of Artificial Intelligence, Large Vision-Language Models (LVLMs) like LLaVA and InstructBLIP have become the superstars. They can look at an image, understand it, and describe it in fluent natural language. Ask them to describe a kitchen, and they will tell you about the fridge, the stove, and the fruit bowl.
But there is a catch. Sometimes, they tell you about a toaster that isn’t there.
This phenomenon is known as Object Hallucination. It’s one of the most persistent and dangerous problems in multimodal AI. If we can’t trust the model to accurately report what it sees, we can’t use it for critical tasks like autonomous navigation or medical imaging analysis.
Today, we are diving into a fascinating research paper titled “DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination.” This paper doesn’t just treat the symptoms; it diagnoses a specific, structural illness inside the model’s “brain”—specifically, how it pays attention—and proposes a clever, training-free cure.

The Root of the Problem: A Tale of Two Attentions
To understand why hallucinations happen, we first need to look at the anatomy of an LVLM. Typically, these models have two main parts:
- Visual Encoder (ViT): The “eyes.” It breaks an image into patches (tokens) and converts them into mathematical representations.
- LLM Decoder: The “mouth.” It takes those visual tokens and generates text.
For a long time, researchers tried to fix hallucinations by finetuning the data or tweaking the text generation. But the authors of DAMRO asked a deeper question: Is the Visual Encoder lying to the LLM?
The “Outlier Token” Phenomenon
The researchers discovered that standard Vision Transformers (ViTs) have a quirk. They tend to generate “high-norm outlier tokens.” In plain English, the model’s attention mechanism gets obsessed with specific patches of the image that often contain very little useful information—usually background noise or redundant areas.
Ideally, if you ask a model to describe a room, it should focus on the furniture. However, the attention maps tell a different story.

As seen in the figure above, look at the bright yellow spots in the heatmap. The model is fixating on random points in the background rather than the main objects. These are the outlier tokens.
The researchers quantified this imbalance. They found that a tiny fraction of the tokens hog almost all the attention.

In Figure 7, notice how the very first few tokens (on the far left) have massive attention values, while the hundreds of other tokens (the “long tail”) are barely noticed. The top 3 tokens alone can account for over 99% of the attention mass in some layers!
The Domino Effect: When the LLM Follows the Leader
If the “eyes” (Visual Encoder) are staring at the wall instead of the sofa, what does the “mouth” (LLM) do?
It turns out the LLM blindly trusts the encoder. The researchers analyzed the attention distribution inside the LLM decoder and found it was strikingly similar to the Visual Encoder.

This consistency is dangerous. When the LLM focuses on these “noisy” background tokens, it loses track of the actual visual details. Lacking clear information, it starts to guess—and that is when hallucinations happen.
The paper provides a compelling visual comparison to prove this. First, look at a case where the model does not hallucinate:

Here, when the model talks about a “plant,” its attention (the bright spots) is actually looking at the plant.
Now, look at a hallucination case:

When the model hallucinates a “clock,” it isn’t looking at a clock (because there isn’t one). Instead, it is staring at those high-attention outlier tokens in the background. The model is effectively “zoning out” and making things up.
Quantifying the Correlation
The researchers proposed a metric, \(H_i\), to measure the overlap between the Visual Encoder’s outliers and the LLM’s attention.

They found a clear correlation: Higher overlap = Higher Hallucination.

The graphs above show that when the model hallucinates (blue bars), the overlap between the Visual Encoder’s bad habits and the LLM’s focus is significantly higher than when it is accurate (orange bars). This confirmed the hypothesis: Outlier tokens are the carriers of hallucinations.
The Solution: DAMRO
The diagnosis is clear: The Visual Encoder highlights garbage tokens, and the LLM treats them like gold, leading to made-up objects.
The cure is DAMRO (Dive into Attention Mechanism to Reduce Object Hallucination). The beauty of this method is that it is training-free. You don’t need to retrain the massive model; you just change how it decodes the answer.
The method has two steps:
- Identify the toxic outlier tokens.
- Neutralize them using Contrastive Decoding.
Step 1: Hunting the Outliers
How do we know which tokens are the bad ones? The researchers utilize the [CLS] token (Classification Token) from the Vision Transformer. In ViT architectures, the [CLS] token aggregates information from the whole image. The tokens that the [CLS] token pays the most attention to are usually the high-norm outliers.
The attention calculation for the [CLS] token is standard:

DAMRO simply selects the top-\(k\) tokens with the highest attention values as the “Outliers.”

These identified tokens are now marked as “negative information.”
Step 2: Contrastive Decoding
Now comes the clever part. We want the LLM to generate text that relies on the image, but not on those specific outlier tokens.
The team uses a technique called Contrastive Decoding. The idea is to calculate two probabilities for the next word:
- Original Logits: The probability distribution using all visual tokens (including the bad ones).
- Negative Logits: The probability distribution using only the outlier tokens.
If the “Negative Logits” (based only on noise) strongly suggest a word (e.g., “clock”), but the rest of the image doesn’t support it, we want to penalize that word.
The formula for the final probability \(p_t\) looks like this:

Here is the intuition behind the math:
- We take the Original prediction (weighted by \(1 + \alpha\)).
- We subtract the Negative prediction (weighted by \(\alpha\)).
- This amplifies the signal from the good parts of the image and suppresses the signal from the outlier parts.
Finally, to ensure the model doesn’t go too far and start generating nonsense, they apply an Adaptive Plausibility Constraint. This ensures that the final probability distribution isn’t too different from what the model originally thought was reasonable.

The Results: Does it Work?
The researchers tested DAMRO on several popular LVLMs, including LLaVA-1.5, LLaVA-NeXT, and InstructBLIP. They used three major benchmarks: POPE, CHAIR, and MME.
1. POPE (Polling-based Object Probing Evaluation)
POPE asks the model “Is there a [object] in the image?” for objects that are and aren’t there. It’s a binary stress test for hallucination.

As shown in Table 2, DAMRO (bottom row for each model) consistently outperforms the “Original” model. For LLaVA-NeXT, the F1 score jumps from 83.07 to 87.60. It also remains competitive with or beats other methods like VCD and M3ID.
2. CHAIR (Caption Hallucination Assessment)
CHAIR measures hallucination in open-ended caption generation. It counts how many objects mentioned in the text actually exist in the image.

There are two metrics:
- CHAIR_S: Percentage of sentences with a hallucination.
- CHAIR_I: Percentage of object instances that are hallucinations.
Lower scores are better here.

The results in Table 3 are striking. For LLaVA-1.5, DAMRO reduced the sentence-level hallucination rate (\(C_S\)) from 12.4% down to 6.0%. That is a massive reduction in errors, effectively halving the number of hallucinated sentences.
3. MME (Multimodal Evaluation)
MME is a comprehensive benchmark. The researchers focused on the hallucination subset (Existence, Count, Position, Color).
(Note: The provided deck includes charts for LLaVA 1.5 and 1.6 as well, showing similar trends where DAMRO balances performance well against baselines).
Qualitative Proof: Seeing is Believing
Numbers are great, but let’s look at an actual example of text generation.

In this example:
- The Image: A luggage cart in a lobby.
- Original LLaVA: Hallucinates a TV and various chairs that are not visible.
- DAMRO: Correctly identifies the luggage cart, bags, and people, without inventing the TV or chairs.
The GPT-4 Evaluation score confirms that the DAMRO description is more accurate (less hallucination) while maintaining good detail.
Conclusion
The “DAMRO” paper provides a crucial insight into the inner workings of Vision-Language Models. It reminds us that bigger isn’t always better; sometimes, the model’s attention mechanisms are flawed, focusing on background noise rather than the signal.
By simply identifying these noisy “outlier tokens” and mathematically subtracting their influence during the text generation process, we can significantly clean up the model’s output.
Key Takeaways:
- Visual Encoders are flawed: They fixate on background outliers.
- LLMs inherit this flaw: They trust the encoder’s bad attention.
- Subtraction is powerful: You don’t always need to teach the model what to do; sometimes it helps to tell it what not to rely on.
- No Training Needed: DAMRO works on existing models immediately.
As we strive for safer, more reliable AI, techniques like DAMRO show that understanding the mechanism of the model—peeking under the hood—is just as important as feeding it more data.
](https://deep-paper.org/en/paper/2410.04514/images/cover.png)