Imagine asking an advanced AI to describe a picture of a living room. The AI confidently tells you about a “sleeping black cat on the sofa.” You look at the image. There is a sofa, but there is absolutely no cat.

This phenomenon is known as object hallucination. It is one of the most persistent and frustrating hurdles in the development of Large Vision-Language Models (LVLMs). These models, which power tools like GPT-4V or LLaVA, have demonstrated incredible capabilities in understanding visual scenes. Yet, their tendency to “invent” objects erodes user trust and limits their deployment in critical fields like robotics or medical imaging.

For some time now, the research community has held a strong intuition about how to fix this: Grounding. The logic is simple: if we force the model to not just say “cat” but also point to where the cat is (using bounding boxes), the model effectively has to “prove” it sees the object. Therefore, training models on grounding tasks should reduce hallucinations.

It sounds perfect. But is it true?

In the paper “Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?”, researchers from the University of Würzburg and the Computer Vision Lab decided to put this assumption to the test. Their findings challenge the conventional wisdom and suggest that the solution to AI hallucinations might be much more elusive than we thought.

In this post, we will dissect their methodology, explore the nuances of measuring hallucination, and analyze why “pointing at things” might not be the silver bullet we hoped for.

The Intuition: Why Grounding Should Work

To understand the researchers’ hypothesis, we first need to look at how most LVLMs are trained. Typically, these models are trained on massive datasets of image-caption pairs (e.g., an image of a dog and the text “A dog sitting on grass”). The model learns to associate visual features with text tokens.

However, this is a “global” association. The model knows the image contains a dog, but it doesn’t necessarily know which pixels represent the dog. This looseness is believed to contribute to hallucinations—the model learns statistical correlations (e.g., “sofas often have cats on them”) rather than precise visual evidence.

Grounding objectives aim to tighten this relationship. They require the model to perform finer-grained tasks:

  1. Referring Expressions (RE): Given a text description (e.g., “the black cat”), predict the bounding box coordinates [x1, y1, x2, y2].
  2. Grounded Captioning (GC): Generate a caption where every mentioned object is immediately followed by its coordinates (e.g., “A dog [0.1, 0.5, …] is sitting…”).

Recent literature has suggested that adding these tasks to the training mix reduces hallucination. The researchers of this paper argue that while this claim is intuitive, the evidence supporting it has been flawed.

The Flaws in Current Evaluation

Why do the authors claim previous evidence is flawed? They identify two major issues with how the field currently measures success:

  1. The “In-Distribution” Trap (MSCOCO): Most evaluations rely on the MSCOCO dataset. The problem is that MSCOCO is the “Hello World” of computer vision; almost every LVLM has seen MSCOCO images millions of times during training. Testing a model on data it has memorized doesn’t tell you if it hallucinates less—it tells you how well it remembers training data.
  2. The QA Shortcoming: Many benchmarks use Yes/No questions (e.g., “Is there a toaster in this image?”). This is known as the POPE benchmark. While useful, answering “No” to a specific question is very different from describing an image from scratch. A model might correctly answer “No” when asked about a unicorn, but still hallucinate a unicorn when asked to “describe the image.”

The Methodology: A Controlled Experiment

To rigorously test the grounding hypothesis, the authors constructed a controlled experimental setup. They didn’t just test existing models; they built their own LVLMs to isolate the specific variables of interest.

1. The Models

They utilized a standard LVLM architecture:

  • Image Encoder: CLIP (ViT-L/14) to process visuals.
  • LLM Backbone: They tested three different language models to ensure robustness: Vicuna 1.5, Llama-3, and Phi-3.
  • Connector: An alignment module (MLP) to bridge the vision and language components.

2. The Training Mixes

They created different versions of each model, varying only the training data to see the specific effect of grounding:

  • Base: Trained on standard Image Captioning and Visual Question Answering (VQA).
  • +RE: Base training plus Referring Expressions data (RefCOCO, Visual Genome).
  • +GC: Base training plus Grounded Captioning data (Flickr30k-Entities).
  • +RE+GC: A combination of all the above.

This setup allows for a direct “apples-to-apples” comparison. If grounding reduces hallucination, the +RE and +GC models should significantly outperform the Base model.

3. The Prompts

To ensure the models understood the tasks, specific prompts were used for training and inference.

Table 5: Prompts used for training and inference.

As shown in Table 5 above, the prompts are straightforward. For grounded captioning, the model is explicitly instructed to include bounding box coordinates.

Measuring Hallucination: A Multi-Pronged Approach

The researchers moved beyond simple Yes/No questions to evaluate hallucination in open-ended captioning. This is a much harder task. If you ask a model to “describe the image,” and it mentions an object that isn’t there, that is a true hallucination.

They employed two sophisticated metrics, visualized below:

Figure 1: CHAIR and FaithScore are used to measure hallcinations in open caption generation with LVLMs. CHAIR relies on human object annotation (over a fixed set) to identify objects and check if they are hallucinated. FaithScore first uses an LLM to convert captions into facts which are then verified by a VQA model.

CHAIR (Caption Hallucination Assessment with Image Relevance)

Look at the left side of Figure 1. The CHAIR metric works by taking the text generated by the LVLM (e.g., “A white hound and a cat…”) and extracting the objects mentioned. It then compares these objects against a “Gold Standard” list of objects actually present in the image (annotated by humans).

  • CHAIR_i: The percentage of mentioned objects that are hallucinations.
  • The Upgrade (CHAIR-MEN): The authors improved this metric. Standard CHAIR uses simple string matching (if the model says “hound” and the list says “dog,” it might count as an error). The authors introduced CHAIR-MEN, which uses semantic embeddings to match synonyms intelligently, allowing for more accurate evaluation on diverse datasets.

FaithScore

Shown on the right side of Figure 1, FaithScore is a model-based metric. It doesn’t rely on a pre-defined list. Instead:

  1. Extract Facts: An LLM (like Llama-3) breaks the caption down into atomic facts (e.g., “There is a cat,” “The hound is white”).
  2. Verify: A separate Visual Question Answering (VQA) model looks at the image to verify each fact.
  3. Score: The percentage of verified facts is the score.

The “Objects365” Curveball

Crucially, the researchers did not stop at MSCOCO. They introduced Objects365 as a test set. This dataset contains a much wider variety of objects (365 categories vs. MSCOCO’s 80) and, most importantly, the models were not trained on it. This effectively tests how the models handle hallucinations in the wild, on unseen data.

The Results: Busting the Myth

So, did the grounding training work?

First, let’s verify that the models actually learned the grounding tasks. It would be unfair to say “grounding doesn’t help” if the models never learned to ground properly in the first place.

Table 2: Precision @ 50 for expression grounding…

Table 2 (above) confirms that the models did learn. The +RE and +RE+GC models achieved high precision in referring expressions. The addition of grounded captioning (+GC) actually helped the referring expression task, suggesting the two objectives are compatible and mutually beneficial for learning spatial awareness.

The Hallucination Verdict

Now for the main event. Did this spatial awareness translate to fewer hallucinations in free-form captioning?

The data suggests: No.

Table 3: Results on standard image captioning…

Table 3 presents the results on standard image captioning for both MSCOCO and Objects365. Let’s look at the CHAIR_i column (where lower is better, as it measures hallucination percentage) and FaithScore (where higher is better).

  • Flat Performance: Compare the Base rows with the +RE, +GC, and +RE+GC rows. The differences are negligible. In some cases, the grounding models perform slightly worse.
  • Consistent Across Backbones: This isn’t a quirk of one model. Llama-3, Phi-3, and Vicuna all show the same trend.
  • Objects365 Reality Check: On the harder, out-of-distribution Objects365 dataset, hallucination rates skyrocket (CHAIR_i jumps from ~3.5 to ~13+). Even here, where the model is struggling, the grounding training provides no safety net.

The takeaway is clear: Training a model to point at objects does not automatically teach it to stop inventing them when describing a scene.

What About Inference?

The researchers tried one last trick. Instead of just training on grounding, what if they asked the model to generate grounded captions during the test? Perhaps forcing the model to output coordinates for every object it mentions would force a “sanity check” in the model’s “brain.”

The results showed a slight improvement. When prompted to generate bounding boxes (Grounded Captioning), hallucinations dropped marginally. However, this came at a cost:

  1. Reduced Detail: The captions became shorter and less informative (lower coverage of objects).
  2. Trade-off: The model became more conservative, but not necessarily “smarter.”

Qualitative Analysis: Seeing is Believing

Why didn’t it work? Why can a model correctly point to a “cat” when asked, but still hallucinate a “woman” in a caption?

Qualitative analysis reveals the disconnect. The grounding mechanism and the generation mechanism seem to operate somewhat independently.

Figure 2: Qualitative examples of Vicuna +RE+GC…

Figure 2 provides some fascinating (and damning) examples of the Vicuna +RE+GC model:

  • Top Example (The “Woman”): The model generates a caption: “An artistic painting of a woman with a vase.” There is no woman. The model hallucinates the woman and generates a bounding box for her (the red box in the image). The model is hallucinating the object and its location simultaneously.
  • Middle Example (The “Rhino-Elephant”): The model describes “Two elephants.” There is one elephant and one rhino. The model draws a box around the rhino and labels it an elephant. The grounding didn’t prevent the misclassification.
  • Bottom Example (The “Peeled Apple”): The model describes a bird eating a “peeled apple.” It draws a box around an orange slice.

These examples illustrate that grounding is not a truth filter. If the language model decides to hallucinate an object due to statistical probability (e.g., “paintings with vases often have women”), the grounding head will dutifully try to find a spot for it, even if that spot is empty or contains something else.

Comparison with QA Metrics (POPE)

The paper also compared these findings with the popular POPE (QA-based) benchmark.

Table 1: POPE results (accuracy)…

Table 1 shows the accuracy on Yes/No questions.

  • Inconsistent Gains: While there are some green numbers (improvements), there is no consistent pattern. For example, Llama-3 +RE+GC actually performs worse than the Base model on several splits of the Objects365 dataset.
  • The Disconnect: The fact that grounding sometimes helps slightly on QA but fails on Captioning highlights that these two tasks measure different cognitive processes. Answering “Is there a dog?” is fundamentally different from “Tell me what you see.”

Conclusion and Implications

This research serves as a vital “reality check” for the Computer Vision and NLP communities. It debunks the prevailing myth that simply adding object grounding objectives to the training mix is a cure-all for hallucinations.

Key Takeaways:

  1. No Causal Link: There is no strong evidence that object grounding capabilities transfer to reduced hallucination in open generation.
  2. Evaluation Matters: Evaluating on training data (MSCOCO) hides the severity of the problem. We need out-of-distribution datasets like Objects365 to see the true limits of our models.
  3. Grounding \(\neq\) Factuality: A model can hallucinate an object and a bounding box simultaneously. Grounding provides spatial alignment, not necessarily factual alignment.

What Next? If grounding isn’t the answer, what is? The authors suggest we need to look elsewhere. Potential avenues might include:

  • Reinforcement Learning from Human Feedback (RLHF): Specifically penalizing hallucinations during the alignment phase.
  • New Architectures: Designing models where the visual perception is a hard constraint on language generation, rather than just a “soft” prompt.
  • Better Data: Moving away from noisy internet-scraped captions toward strictly factual, verified datasets.

By clearing away the misconception that “grounding fixes everything,” this paper paves the way for researchers to find the real solutions that will make Vision-Language Models trustworthy enough for the real world.