Imagine asking an AI to describe a photo of a living room. It correctly identifies the sofa, the television, and the coffee table. But then, it confidently adds, “and there is a cat sleeping on the rug.” You look closely. There is no cat. There has never been a cat.

This phenomenon is known as Object Hallucination. It is one of the most persistent and dangerous problems in Large Vision-Language Models (LVLMs) like LLaVA or GPT-4V. In high-stakes fields like medical imaging or autonomous driving, a hallucinated tumor or a non-existent pedestrian can be catastrophic.

For a long time, researchers have pointed the finger at the “brain” of these systems—the Large Language Model (LLM) component. They assumed the LLM was making things up because of its training on vast amounts of text. But what if the problem isn’t the brain, but the eyes?

In this post, we are doing a deep dive into the paper “Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models.” We will explore how researchers discovered that the visual encoder (CLIP)—the very eyes of the AI—is often the source of these hallucinations. More importantly, we will break down the mathematical method they developed to perform “corrective eye surgery” on these models, significantly reducing hallucinations.

The Suspect: The Visual Encoder

To understand the solution, we first need to understand the architecture of a typical LVLM. These models generally consist of three parts:

  1. Visual Encoder: Usually a CLIP (Contrastive Language–Image Pre-training) model. This converts an image into a mathematical representation (embeddings).
  2. Modality Connector: A layer that translates visual embeddings into something the language model understands.
  3. Large Language Model (LLM): The component that generates the text response.

The researchers hypothesized that if the Visual Encoder (CLIP) itself suffers from hallucinations—meaning it thinks an image contains objects that aren’t there—the LLM downstream has no choice but to hallucinate in its text output. It is a classic case of “Garbage In, Garbage Out.”

Investigating CLIP

CLIP is trained to match images with text descriptions. Ideally, if you show CLIP a picture of a “tower,” it should have a high similarity score with the text “a tower” and a low score with “a tower and a cityscape.”

To test if CLIP was failing at this, the authors created a new benchmark called OHD-Caps (Object Hallucination Detection).

The pipeline of the benchmark creation process showing how an image is processed to create positive and negative captions.

As shown in the pipeline above, the process works like this:

  1. Object Identification: They use a segmentation model (SEEM) to identify what is actually in the image (e.g., mountain, tree, sky).
  2. Negative Sample Generation: They use GPT-4 to generate “hallucinated” captions. They might insert a random object (e.g., “cityscape”) or remove an existing one.
  3. The Test: They feed the image, the correct caption, and the hallucinated captions into CLIP.

If CLIP is working correctly, it should assign the highest score to the correct caption. However, the results were startling. The standard CLIP ViT-L/14 model, widely used in many AI systems, only identified the correct caption 19.0% of the time. In the example above, CLIP actually preferred the caption claiming there was a “cityscape” in the background, even though there wasn’t one.

This proved the hypothesis: The visual encoder is prone to hallucinations. It struggles to distinguish between “a tower” and “a tower with a cityscape” because it likely learns a “bag-of-words” representation—capturing the general vibe of the image rather than precise object existence.

The Cure: Fine-Grained Object-Level Contrastive Loss

Since standard CLIP training isn’t enough to prevent hallucinations, the authors proposed a fine-tuning method. The core of this method is a new loss function designed to force the model to pay attention to specific objects.

Let’s break down the mathematics of this solution step-by-step.

1. The Standard Contrastive Loss (The Baseline)

First, let’s look at how CLIP is normally trained. The goal is to maximize the similarity between an image (\(I\)) and its correct text (\(T^+\)), while minimizing the similarity to incorrect texts (\(T^-\)).

The standard contrastive loss equations for image-to-text and text-to-image.

In this equation:

  • \(\mathcal{L}_{i2t}\) is the image-to-text loss.
  • The numerator \(\exp(I \cdot T^+ / \tau)\) represents the strength of the match between the image and the correct text.
  • The denominator sums up the match strength for the correct text and the negative texts (\(T^-\)).
  • \(\tau\) is a temperature parameter that scales the values.

In standard training, \(T^-\) usually just refers to captions of other images in the same batch. This is easy for the model to solve. It just needs to know that a caption about a “dog” doesn’t match a picture of a “car.” It doesn’t force the model to look closely for small details.

2. Injecting Hallucinations into the Loss

To fix this, the authors introduce the specific hallucinated captions (generated in the OHD-Caps benchmark) into the training process. Let’s call these “enhanced negative samples” or \(T^{neg}\).

They modify the loss function to explicitly include these hard negatives in the denominator:

The modified image-to-text loss function including negative samples.

Now, the model is being punished if it thinks the image (\(I\)) is similar to the hallucinated text (\(T^{neg}\)). It forces the model to distinguish between “A photo of a dog” (Correct) and “A photo of a dog and a cat” (Hallucination), rather than just distinguishing against “A photo of a car.”

3. The Margin Loss (Forcing Separation)

Simply adding the negatives to the denominator might not be enough. The authors want to ensure there is a distinct “gap” or margin between the score of the correct text and the score of the hallucinated text.

They introduce a Margin Loss (\(\mathcal{L}_1\)). This loss acts like a wedge, forcing the similarity of the correct pair (\(I \cdot T^+\)) to be higher than any incorrect pair (\(I \cdot T^*\)) by at least a value of \(\tau_1\).

The equation for the first margin loss.

If the difference between the correct score and the wrong score is already larger than \(\tau_1\), the loss is zero. If not, the model is penalized. This guarantees that the correct caption isn’t just “slightly better” than the hallucination—it is distinctively better.

4. Differentiating Hallucinations from Random Errors

Not all errors are equal. A caption describing a completely different image (in-batch negative, \(T^-\)) is very clearly wrong. A caption describing the current image but with one added object (\(T^{neg}\)) is “partially” correct.

The authors realized the model should recognize that the hallucinated caption (\(T^{neg}\)) is more similar to the image than a completely random caption (\(T^-\)), even though it is still wrong. This helps the model learn semantic nuance.

They added a second Margin Loss (\(\mathcal{L}_2\)) to enforce that the hallucinated text scores higher than the random text by a margin of \(\tau_2\):

The equation for the second margin loss comparing negative samples.

This encourages the model to understand that \(T^{neg}\) shares content with the image (the correct objects), while \(T^-\) does not.

5. The Final Objective

Combining all these components, the final training objective becomes a weighted sum of the standard contrastive loss and the two new margin losses:

The final total loss function combining all components.

Here, \(\lambda_1\) and \(\lambda_2\) are weights that control how much importance is placed on the margin losses. By optimizing this function, the CLIP model learns to be extremely picky about object presence.

Experiments and Results

The researchers fine-tuned CLIP models using this new loss function on the COCO and Flickr30k datasets. But did it actually work?

Improvements in Hallucination Detection

The primary test was the OHD-Caps benchmark—the same test where the original model failed miserably.

Table showing results on OHD-Caps. The proposed method significantly outperforms baselines.

The results in Table 2 are dramatic:

  • CLIP ViT-B/32: The score jumped from 14.3 (Vanilla) to 82.5 (Ours).
  • CLIP ViT-L/14: The score improved from 23.3 to 88.8.

The model evolved from being easily fooled by hallucinations to reliably identifying the chemically pure caption.

Zero-Shot Generalization

A common fear in fine-tuning is “catastrophic forgetting”—fixing one problem but breaking the model’s general abilities. To check this, the authors tested the model on standard zero-shot classification tasks (like identifying objects in CIFAR-10 or ImageNet).

Table showing zero-shot results on various datasets. The method maintains performance.

As shown in Table 3, the proposed method (Ours) maintains comparable performance to the original CLIP model. For CLIP-B/32, the average accuracy even increased slightly from 65.6% to 66.0%. This confirms that the “eye surgery” didn’t damage the model’s general vision.

Visual Examples

What does this improvement look like in practice?

Examples from the OHD-Caps benchmark showing how the model distinguishes captions.

In Figure 3, we see examples of the data used. The model has to distinguish between subtle changes.

  • Left: A snowboarder. The negative sample adds “a flag.”
  • Middle: A barber. The negative sample adds “a customer.”
  • Right: Energy drinks. The negative sample adds “food.”

The fine-tuned model successfully learned to reject the captions containing the red-highlighted words, showing it is actually looking at the image content rather than guessing.

Does it Help Large Vision-Language Models?

The ultimate goal was to fix hallucinations in LVLMs like LLaVA. The researchers took their improved CLIP encoder and swapped it into the LLaVA-1.5 architecture.

They evaluated the new LLaVA using the AMBER dataset, which checks for hallucinations in both generative (writing descriptions) and discriminative (answering Yes/No) tasks.

Table showing results on the AMBER dataset.

Table 6 shows clear improvements:

  • Accuracy: Increased from 74.3 to 80.2.
  • F1 Score: Increased from 77.2 to 84.9.
  • Hallucination Rate (\(C_I\)): In the generative task, the percentage of captions with hallucinations dropped from 35.4% to 31.7%.

This validates the core hypothesis: Improving the visual encoder’s ability to reject hallucinations leads to a more truthful LVLM.

How Much Data is Needed?

One surprising finding was the efficiency of this method. You might think you need millions of images to retrain CLIP.

Graph showing model performance versus training data volume.

Figure 2 shows that the model achieves significant gains with very little data. Even with just 1% of the training data (roughly 160 images), the performance jumps from ~20% to ~60%. This suggests that the model already had the capacity to see correctly; it just needed the right learning signal (the specific loss function) to unlock it.

Conclusion

The problem of object hallucination in AI has often been treated as a language generation issue—a case of the AI simply being too creative or “lying.” This research pivots that perspective, showing that the issue often starts with perception. The AI’s “eyes” (CLIP) were never trained to strictly verify the existence of objects; they were trained to associate images with loose textual themes.

By introducing Counterfactual Data Augmentation (creating negative samples with specific hallucinations) and a Fine-Grained Object-Level Contrastive Loss, the authors successfully taught CLIP to be precise.

The key takeaways are:

  1. Don’t Trust Vanilla CLIP: Off-the-shelf CLIP models are surprisingly bad at verifying if an object is actually present in an image.
  2. Hard Negatives Matter: Training with “hallucinated” captions is far more effective than training with random captions.
  3. Fix the Eyes, Fix the Brain: Improving the visual encoder is a highly effective way to reduce hallucinations in downstream multimodal models like LLaVA.

As we move toward autonomous agents that interact with the physical world, this kind of rigorous visual verification will be essential. We cannot have robots that hallucinate obstacles or medical AI that hallucinates symptoms. This paper provides a concrete mathematical blueprint for building AI that doesn’t just look, but truly sees.