Imagine showing a photograph of a cat sleeping on a table to an Artificial Intelligence. You ask, “Is there a dog in this picture?” The AI confidently replies, “Yes, there is a dog sleeping on the table.”
This phenomenon is known as visual hallucination. It is one of the most persistent and perplexing challenges in the field of Multimodal Large Language Models (MLLMs)—systems like LLaVA, Qwen-VL, or GPT-4V that can see and speak. While these models have demonstrated incredible capabilities, they frequently fabricate objects, misinterpret attributes, or agree with false premises provided in the text prompt.
In this post, we will deep-dive into a research paper that attempts to solve the evaluation crisis of visual hallucination. The paper introduces PhD (Prompted hallucination Dataset), a massive, semi-automatically generated benchmark designed to not only catch models hallucinating but to diagnose why they do it.
The Problem: Why Do Models Hallucinate?
Before we can evaluate a model, we need to understand the root causes of its errors. Hallucination isn’t just random noise; it usually stems from structural weaknesses in how MLLMs process information. The authors of the PhD paper categorize visual hallucinations into three distinct causes:
- Visual Ambiguity (Cause I): The model’s “eyes” (the visual encoder) fail to capture sufficient detail. For example, the model might see a blurry shape and guess it’s a person when it’s actually a plant.
- Inconsistency in Multi-modal Input (Cause II): MLLMs are often fed both an image and a text prompt (context). If the text suggests something that isn’t in the image, the model—which is heavily trained on text—often biases towards the text, ignoring its visual input.
- Counter-Common-Sense Content (Cause III): LLMs possess vast “world knowledge” or priors (e.g., “cars have round wheels”). When an image contradicts this common sense (e.g., a generated image of a car with square wheels), the model often prioritizes its internal knowledge over what it actually sees.

As shown in Figure 1 above, existing evaluations often miss these nuances. Panel (a) shows visual ambiguity where a model hallucinates a toy. Panel (b) shows how text input can mislead the model. Panel (c) highlights how the PhD benchmark (the colored lines) is significantly harder and more revealing than previous benchmarks like POPE (the flat blue line at the top), which have largely reached saturation.
The Landscape of Evaluation
To place PhD in context, we must look at how visual hallucination is currently measured. The authors propose a taxonomy dividing benchmarks by task level (low vs. high) and evaluation method (objective vs. subjective).

As seen in Table 1, PhD focuses on Objective Evaluation (Yes/No questions) for Low-to-Middle level tasks. Why this focus?
- Objectivity: Subjective evaluation (asking an LLM “how good was this answer?”) is expensive and prone to its own hallucinations. Binary Yes/No questions are clear-cut.
- Task Level: If we ask a model to solve complex medical reasoning, it might fail simply because it lacks medical knowledge. By sticking to basic tasks like counting, attribute recognition, and object detection, errors can be safely attributed to hallucination rather than a lack of domain expertise.
The Core Method: Constructing the PhD Dataset
The brilliance of the PhD paper lies in its construction pipeline. Creating a dataset large enough to train or evaluate deep learning models usually requires thousands of hours of human labor. The authors instead devised a ChatGPT-assisted semi-automated pipeline.
They utilized the TDIUC dataset (which contains real-world images) as a base and expanded it using Generative AI. The pipeline follows a specific roadmap to target the three causes of hallucination mentioned earlier.

Let’s break down the pipeline illustrated in Figure 2:
1. The “Trap”: Task-Specific Hallucinatory Item Selection
To test if a model hallucinates, you need to ask it about something that isn’t there, but could be. This is called a Hallucinatory Item (hitem).
- Old Way: Randomly pick an object not in the image. (Too easy).
- PhD Way: Use ChatGPT and CLIP (a vision-language model).
- Step A: Look at an image of a black motorcycle.
- Step B: Ask ChatGPT for color candidates (red, blue, green).
- Step C: Use CLIP to check which color is visually closest to the image context without actually being present. Perhaps there is a red sign nearby.
- Step D: Select “Red” as the hitem. This makes the question “Is there a red motorcycle?” a difficult visual trap.
2. The Four Modes of PhD
The dataset is structured into four “modes,” each designed to stress-test a specific weakness of MLLMs.
- PhD-base (Testing Visual Ambiguity): Standard visual question answering.
- Example: “Is the motorcycle in the image red?” (Image shows a black motorcycle).
- Goal: Test if the vision encoder is accurate.
- PhD-sec (Testing Specious Context): This mode adds “specious” (misleading or noisy) text before the question.
- Example: The prompt includes a sentence like “Red motorcycles are common in this city.”
- Goal: Test if the model gets confused by text that hints at the hallucination.
- PhD-icc (Testing Incorrect Context): This mode adds text that explicitly contradicts the image.
- Example: The prompt falsely claims, “This picture shows a red motorcycle.”
- Goal: Test if the model blindly follows the user’s text prompt or trusts its own eyes.
- PhD-ccs (Testing Counter-Common-Sense): This is the most creative mode. The researchers used DALL-E 3 to generate images that defy physics or logic.
- Example: A car with square wheels, or trees growing underwater.
- Goal: Test if the model relies on its training priors (“cars have round wheels”) or visual reality.

3. Scale and Diversity
The result is a massive dataset. Unlike previous benchmarks like POPE (3,000 triplets) or AMBER (14,000 triplets), PhD offers over 102,000 VQA triplets.

As shown in Table 4, the dataset covers five distinct tasks:
- Object Recognition (Is there a cat?)
- Attribute Recognition (Is the car red?)
- Sentiment Recognition (Does the person look sad?)
- Positional Recognition (Is the cup behind the laptop?)
- Counting (Are there three birds?)
This granularity allows researchers to pinpoint exactly where a model is failing. A model might be great at object detection but terrible at counting.
Experiments and Key Results
The authors evaluated 15 open-source MLLMs (including LLaVA, Qwen-VL, InternVL) and 3 proprietary models (GPT-4o, Claude 3.5, Gemini 1.5 Pro).
To measure success, they used the PhD Index. Since the dataset is binary (Yes/No), a model could cheat by just saying “Yes” to everything. The PhD Index is the harmonic mean of the “Yes” recall and “No” recall, ensuring that a high score requires accuracy on both positive and negative questions.
1. The Reality Check
The most striking result is how much open-source models struggle compared to proprietary giants.

Look at Table 5.
- Saturation of Old Benchmarks: On the POPE benchmark, almost every model scores between 0.80 and 0.88. It looks like the problem is solved.
- The PhD Drop: On the PhD benchmark, scores plummet. The best open-source model, LLaVA-OneVision, drops to 0.698. Older models like LLaVA-1.5 drop to a shocking 0.265.
- Proprietary Gap: GPT-4o maintains a score of 0.812, highlighting a significant gap in robustness that previous benchmarks failed to show.
2. Qualitative Failures
It is helpful to look at specific examples to understand these numbers.

In Figure 3, examine the third row regarding the “square tires.”
- The Image: A generated image of a car with square blocky tires.
- The Question: “Are the tires in the image circular?”
- The Truth: No.
- LLaVA-1.6-L: “Yes, the tires in the image are circle-shaped.”
- Why? The model sees a car. Its internal knowledge says “car tires are round.” It ignores the visual evidence of square blocks because the concept is too “counter-common-sense.”
3. The Paradox of Model Size
In deep learning, we usually assume “bigger is better.” A 13-billion parameter model should beat a 7-billion parameter model. The PhD paper reveals a fascinating nuance here.

Figure 5 shows a complex relationship:
- Context Modes (PhD-sec/icc): Larger models (13B) perform better. They are smarter at understanding instructions like “ignore the text if it conflicts with the image.”
- Base & CCS Modes: Larger models perform worse. Why? Larger Language Models contain stronger priors. They have read more text about the world. Consequently, they are more stubborn when visual reality contradicts their world knowledge (e.g., the square tires). They are more likely to hallucinate based on what they expect to see.
4. The “Yes” Bias
Finally, the researchers analyzed the tendency of models to answer “Yes.”

Figure 6 shows a strong negative correlation (Spearman -0.92) between the “Say-Yes rate” and the PhD Index.
- Low-performing models act like “Yes-men.” They agree with whatever the prompt suggests.
- High-performing models (like GPT-4o, bottom right) have a much lower Say-Yes rate. They have learned to disagree and say “No.”
5. Task-Specific Diagnostics
Because PhD is broken down by task, we can see specific weaknesses in top models.

Table 6 provides a “zoom-in” analysis.
- LLaVA-OneVision is excellent at Object recognition (0.872) but struggles significantly with Counting (0.707) and Sentiment (0.691).
- Molmo shows a massive drop in performance when introduced to Incorrect Context (PhD-icc), dropping from 0.842 in Attribute recognition to 0.556. This indicates it is highly susceptible to being misled by text prompts.
Conclusion and Implications
The PhD paper represents a significant step forward in the evaluation of Multimodal AI. By moving beyond simple object co-occurrence and tackling the psychological causes of hallucination—visual ambiguity, text bias, and prior knowledge conflicts—it provides a mirror that reflects the true limitations of current models.
Key Takeaways:
- Benchmarks Matter: We cannot improve what we cannot measure. Old benchmarks were letting models “pass” too easily. PhD raises the bar.
- Trust Your Eyes: The biggest failure mode for current AI is prioritizing training data (priors) or user prompts over visual evidence.
- One Size Does Not Fit All: Simply making models bigger doesn’t fix hallucination; in some cases (Counter-Common-Sense), it makes it worse.
For students and researchers entering this field, PhD offers a robust tool for testing new architectures. It suggests that future improvements won’t just come from more data, but from better alignment strategies that teach models to critically evaluate visual evidence against their internal expectations.
](https://deep-paper.org/en/paper/2403.11116/images/cover.png)