Introduction

Imagine showing an AI a picture of a man standing on a beach. You ask, “What is happening here?” The AI confidently responds, “A man is standing on the beach holding a surfboard.”

There is just one problem: there is no surfboard.

This phenomenon is known as Visual Hallucination (VH). It is one of the most persistent and frustrating challenges in Large Vision-Language Models (LVLMs) like LLaVA or MiniGPT-4. While these models are incredible at describing complex scenes, they often “dream up” objects, relationships, or attributes that simply aren’t there. They might rely on language habits (statistically, “man on beach” often appears with “surfboard”) rather than strictly adhering to the visual data provided.

In this post, we will dive deep into a fascinating new research paper titled “Game on Tree: Visual Hallucination Mitigation via Coarse-to-Fine View Tree and Game Theory.” The researchers propose a clever, training-free method called GTHM that combines hierarchical data structures with cooperative game theory to force the AI to look closer before it speaks.

The Problem: Fuzzy Vision and Language Priors

To understand the solution, we first need to understand why models hallucinate.

LVLMs generate text “autoregressively.” This means they predict the next word in a sentence based on the image and all the words they have generated so far. However, as the sentence gets longer, the model tends to pay more attention to the text history (its internal language patterns) and less attention to the actual image.

Furthermore, looking at an entire image at once can be overwhelming. If the model tries to verify a specific detail (like a small object) by looking at the whole “global” image, it might miss it. Conversely, if it looks too closely at a patch without context, it might misinterpret it.

Figure 1: Experiment using LLaVa-1.5 shows that the inability to perceive the optimal view during decoding can exacerbate VH and produce incorrect tokens.

As shown in Figure 1, when the model (LLaVA-1.5) fails to focus on the correct visual region (“view perception”), it starts hallucinating. In the example above, the model hallucinates a surfboard and a book. The graph on the right shows that these errors correlate with a low “Tree-based Shapley value”—a metric the authors invented to measure how well the model is looking at the right things.

The Empirical Evidence

The researchers didn’t just guess this; they proved it. They analyzed thousands of outputs and found a clear pattern:

Hallucinations happen when visual attention drops.
Longer sentences breed hallucinations. As the description goes on, the model relies more on what it said previously rather than what it sees.

Figure 2: Analysis of the outputs of different LVLMs showing the relationship between hallucination and visual perception scores.

Figure 2 illustrates this analysis. The leftmost graph shows that non-hallucinatory captions (green) consistently have higher visual perception scores (TSV) than hallucinatory ones (purple). The middle graph shows that as the sentence index increases (the sentence gets longer), the visual attention score drops, increasing the risk of making things up.

The Solution: GTHM (Game and Tree Hallucination Mitigation)

To fix this, the researchers developed GTHM. It is a “plug-and-play” decoding algorithm, meaning you don’t need to retrain the massive AI model to use it. You simply change how the model selects words during generation.

The framework consists of three main components:

CFTree: Structuring the image into a hierarchy.
Game Theory: Using Shapley values to find the “best” view.
Adaptive Contrastive Decoding: Adjusting the probability of words based on the best view.

Let’s look at the full architecture before breaking it down:

Figure 3: The illustration of the proposed GTHM framework, consisting of the coarse-to-fine view tree, game modeling, and adaptive decoding.

Component 1: The Coarse-to-Fine Visual View Tree (CFTree)

If you are looking for a specific book in a library, you don’t stare at the entire building. You enter the building (Event), go to the correct aisle (Relation), and look at a specific shelf (Entity).

The authors apply this logic to images. They use object detection tools (like GroundingDINO) to parse the image into a Coarse-to-Fine Visual View Tree (CFTree).

Level 1: Event Layer (Root): The entire image. This captures the global context.
Level 2: Relation Layer: Pairs of objects and the bounding boxes that encompass them. This captures interactions (e.g., “man holding cup”).
Level 3: Entity Layer (Leaves): Specific objects (e.g., “cup,” “man”).

This structure organizes visual information so the model can “zoom in” or “zoom out” as needed.

Component 2: Game Theory on the Tree

Now that we have a tree of different “views” (the whole image, a specific region, or a tiny object), how does the model decide which view is the most useful for generating the next word?

The researchers treat this as a Coarse-to-Fine Cooperative Game.

The Players: The different nodes (views) in the CFTree.
The Goal: To maximize the similarity between the visual view and the text token being generated.
The Reward: A score indicating how much a view contributes to the understanding of the scene.

To calculate the contribution of each view, they use the Shapley Value. In classical game theory, the Shapley value fairly distributes “payouts” to players based on their contribution to the team. Here, it calculates how much a specific visual region contributes to the probability of the correct word appearing.

The standard definition of the Shapley value is:

Equation 5: The standard Shapley value formula.

However, calculating this for every possible combination of pixels is impossible. Instead, the authors propose the Tree-based Shapley Value (TSV). This modified version respects the hierarchy of the tree. It measures the benefit of a specific “view path” (e.g., Image \(\rightarrow\) Person \(\rightarrow\) Hand) minus the benefit of the sub-components.

Equation 8: The Tree-based Shapley Value (TSV) definition.

Intuitively, a high TSV means that looking at this specific “zoom level” (node) provides crucial visual evidence that justifies the next word.

Component 3: Vision-Aware Contrastive Decoding

Once the system calculates the TSV for different views, it identifies the “Best Player”—the view that offers the most clarity.

Standard LVLMs use a probability distribution \(p_\theta\) to pick the next word. GTHM modifies this distribution using Contrastive Decoding. It contrasts the distribution derived from the optimal visual view (the winner of the game) against a less optimal view.

The formula looks like this:

Equation 9: Adaptive contrastive decoding formula.

Here is what is happening in this equation:

It boosts the probability of tokens that are supported by the optimal view (\(v_i\)).
It penalizes tokens that are supported by the sub-optimal view (\(v_j\)).
The factor \(\lambda_{\phi}\) is adaptive. It is based on the ratio of the Shapley values. If the optimal view is much better than the other view, the model applies a stronger correction.

This forces the model to choose words that are actually grounded in the best visual evidence, rather than words that just sound grammatically correct.

Experiments and Results

Does adding game theory to a decision tree actually work? The researchers tested GTHM on several popular benchmarks, including CHAIR (which counts object hallucinations) and POPE (which asks Yes/No questions about object existence).

Quantitative Success

The results on the MSCOCO dataset were impressive. In the table below, CHAIRs and CHAIRi measure the percentage of hallucinations (lower is better).

Table 1: Comparison of CHAIR evaluation results. GTHM achieves the lowest hallucination rates across different models.

As shown in Table 1, GTHM significantly outperforms standard Greedy decoding and Beam Search, as well as other state-of-the-art methods like VCD and HALC. For example, using the LLaVA-1.5 model, GTHM reduced sentence-level hallucinations (CHAIRs) from 22.17 (Greedy) down to 12.67.

They also tested on the POPE benchmark, where the model answers questions like “Is there a dining table in the image?”

Table 2: POPE results showing high accuracy and precision for GTHM.

In Table 2, GTHM achieves the highest accuracy and F-score across almost all models, proving that it helps the model correctly identify whether objects exist or not.

Qualitative Examples: Seeing is Believing

The numbers are great, but the visual examples really highlight the difference.

Example 1: The “Mona Lisa” Dog

In this example, the input image is a parody of the Mona Lisa featuring a dachshund.

Greedy Decoding (Standard): Hallucinates a “large, flowing dress” (because the original Mona Lisa wears a dress).
GTHM (Ours): Correctly identifies the “Renaissance garb” but importantly notices the specific details of the dog without inventing a dress.

Figure 4: Qualitative comparison on LLaVA-Bench. GTHM correctly identifies visual details while other methods hallucinate clothing.

Example 2: Anime Character Details

Here, the model looks at an anime character.

Greedy: Hallucinates “long hair” and calls the character a “protagonist of a cartoon.”
VCD (Baseline): Weirdly suggests the character is “Dracula” and hallucinates a “suitcase.”
GTHM: Accurately describes the blue suit, red bow tie, and the fact that he is posing.

Figure 6: Qualitative comparison on MiniGPT-4. GTHM avoids the wild hallucinations of baseline methods.

Example 3: The Tea Party

In a complex scene with multiple animals, identifying who is doing what is difficult.

Greedy: Hallucinates “two teddy bears.”
GTHM: Correctly identifies the rabbit, cat, and bear, and the food items on the table.

Figure 7: Qualitative comparison on mPLUG-Owl2. GTHM correctly enumerates the animals at the tea party.

Conclusion

The GTHM framework represents a significant step forward in making Multimodal Large Language Models trustworthy. By acknowledging that “one view doesn’t fit all,” the researchers successfully applied a hierarchical tree structure to organize visual data.

The true innovation, however, lies in applying Game Theory. By treating visual views as players in a cooperative game, the model can mathematically determine which part of the image is most important for the current word. This allows for adaptive decoding—dynamically adjusting the model’s confidence based on actual visual evidence.

Key Takeaways:

Hallucinations occur when models rely on language priors or imprecise visual views.
CFTree organizes image data from coarse (global) to fine (local).
Tree-based Shapley Values identify the most valuable visual view for decoding.
GTHM is a training-free solution that significantly reduces hallucinations on major benchmarks.

As LVLMs become more integrated into our daily lives—from describing photos for the visually impaired to analyzing medical imaging—reducing hallucinations is not just a technical improvement; it is a safety requirement. GTHM shows that sometimes, to see the truth, you have to play the game.

Introduction#

The Problem: Fuzzy Vision and Language Priors#

The Empirical Evidence#

The Solution: GTHM (Game and Tree Hallucination Mitigation)#

Component 1: The Coarse-to-Fine Visual View Tree (CFTree)#

Component 2: Game Theory on the Tree#

Component 3: Vision-Aware Contrastive Decoding#

Experiments and Results#

Quantitative Success#

Qualitative Examples: Seeing is Believing#

Example 1: The “Mona Lisa” Dog#

Example 2: Anime Character Details#

Example 3: The Tea Party#

Conclusion#