Taming Hallucinations in Vision-Language Models with the Octopus Framework

Introduction

Imagine asking an AI to describe a picture of a soccer field. The model confidently replies, “A player in a green jersey is kicking the ball toward the goal.” It sounds perfect, except for one problem: there is no ball in the picture.

This phenomenon is known as hallucination. Large Vision-Language Models (LVLMs), despite their incredible ability to understand images and text, frequently fabricate objects, attributes, or relationships that simply don’t exist. For casual use, this is annoying. For critical applications like medical imaging analysis or autonomous driving, it is dangerous.

In this post, we will deep-dive into a fascinating research paper titled “Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding.” The researchers propose a novel framework that treats hallucination not as a single problem, but as a dynamic, shifting challenge. Instead of using a one-size-fits-all solution, they introduce “Octopus”—a method that adaptively switches strategies on the fly to keep the model grounded in reality.

The Problem with “One-Size-Fits-All”

To understand why Octopus is necessary, we first need to look at how researchers currently fight hallucinations. There are generally two schools of thought:

Retraining: This involves curating high-quality datasets and fine-tuning the model to be more honest. While effective, it is expensive, time-consuming, and computationally heavy.
Contrastive Decoding (CD): This is a post-hoc method (applied after training) that manipulates the model’s output probabilities during generation. It typically compares the model’s standard output against a “distorted” input to cancel out biases.

Figure 1. Paradigm comparison of different hallucination alleviation methods. (a) Retraining method. (b) Contrastive Decoding. (c) Octopus.

As shown in Figure 1, Retraining (a) updates the model weights, while Contrastive Decoding (b) compares logits from original and distorted inputs. The paper introduces Octopus (c), which adds a decision-making layer to dynamically select the best strategy.

The Limits of Static Strategies

Contrastive Decoding (CD) has been a popular fix because it doesn’t require retraining the massive model. Several specific CD strategies exist:

VCD (Visual Contrastive Decoding): Adds noise to the image to stop the model from relying too much on language priors (e.g., assuming “sofa” is followed by “living room” regardless of the image).
M3ID: Masks the text to force the model to look at the image more, reducing visual information loss.
AVISC: Uses “blind tokens” to minimize attention bias.

However, the authors of Octopus posed a critical question: Does one strategy work for every image?

They ran diagnostic experiments using these existing strategies. The results, visualized below, were telling.

Figure 2. The proportion of effective samples using different CD methods for Generative and Discriminative Tasks.

Figure 2 reveals a crucial insight: existing CD strategies are complementary, not redundant.

In the AMBER dataset (generative task), roughly 48% of samples were improved by only one specific strategy.
Only about 10% of samples benefitted from all strategies.

This implies that if you pick just one method (e.g., VCD) and apply it to everything, you are failing to correct—or potentially worsening—a huge chunk of the data.

Hallucinations Happen Token-by-Token

The researchers went deeper. They discovered that hallucination isn’t just a “per-image” problem; it’s a “per-token” problem. In a single sentence generated by an AI, one word might be hallucinated because of language bias, while the next word is hallucinated because of attention bias.

Figure 4. Token-level hallucination qualitative analysis showing hybrid causes in a sample.

In Figure 4, look at the attention map. The word “sitting” was triggered by attention bias (Strategy 3), while “lying” was caused by visual information loss (Strategy 2).

This discovery is the foundation of the Octopus method. If the cause of hallucination changes word by word, the solution must also change word by word.

The Octopus Framework

The Octopus framework is designed to sit on top of an existing LVLM (like LLaVA). It does not change the LVLM’s weights. Instead, it acts as a conductor, deciding which decoding strategy to use for every single token generated.

The Metaphor: Eye and Tentacles

The authors use a biological metaphor to describe the architecture:

The Eye: A mechanism to perceive the current state and detect what kind of hallucination might occur.
The Tentacles: The various Contrastive Decoding strategies available to fix the problem.

Figure 5. Overview of our method. Octopus consists of the decision token ’eye’ and its tentacles (strategies).

As illustrated in Figure 5, the workflow involves inputting the image and text. An “Eye” token processes the hidden states, and a decision is made to activate a specific “Tentacle” (Strategy k) to correct the output.

Step-by-Step Architecture

Let’s break down the math and mechanics of how Octopus works during the generation process.

1. The Standard LVLM Process Normally, an LVLM generates a token \(y_t\) based on the image \(v\), the query \(q\), and the previous words \(y_{

Equation 2 Equation 3

2. The Contrastive Decoding (CD) Basis In a standard CD approach, the model subtracts the probability distribution of a “distorted” input (like a blurred image) from the original input. This penalizes the model for guessing based on bias.

Equation 5

3. The Octopus “Eye” (Detection) The core innovation is a small Transformer-based block, denoted as \(O_{\phi}\). At each step \(t\), it takes the hidden states of the LVLM (visual, query, and history) and a learnable token called eye.

Equation 10

This eye token aggregates all the context needed to decide if the model is currently hallucinating and why.

4. The Decision (Action) The output of the eye token is passed through a Multi-Layer Perceptron (MLP) to create an action vector.

Equation 11

This vector represents the probability of choosing a specific strategy. The model creates a “one-hot” vector \(a_t\) to select the best strategy (or “tentacle”) for this specific moment.

Equation 12

The available actions in this paper are:

VCD (Targeting language priors)
M3ID (Targeting vision loss)
AVISC (Targeting attention bias)
Null (Do nothing / standard generation)

Optimizing the Octopus

Here lies a significant challenge: We don’t have labels for this. There is no dataset that says, “For the 5th word of image #103, you should use VCD.”

To solve this, the researchers use Direct Preference Optimization (DPO).

Instead of needing explicit labels, DPO allows the model to learn from preferences. The researchers generate multiple outputs using different random combinations of strategies. They then measure the hallucination rate of these outputs (using a metric like CHAIR).

Positive Sample (\(A^+\)): A sequence of strategies that resulted in a truthful caption.
Negative Sample (\(A^-\)): A sequence that resulted in hallucinations.

The Octopus is trained to maximize the likelihood of selecting the strategies used in the Positive Sample and minimize the Negative Sample.

Equation 13

This allows the “Eye” to learn intuitively which situations require which “Tentacle” without ever needing a human to manually label the strategies.

Experiments and Results

Does this dynamic switching actually work? The authors tested Octopus against state-of-the-art methods across several benchmarks.

Generative Tasks (Image Captioning)

The primary test is the Generative Task: asking the model to describe an image. The metrics used include CHAIR (Caption Hallucination Assessment with Image Relevance), where a lower score is better because it means fewer hallucinations.

Table 1. Comparison with the state-of-the-art methods for the generative task.

Table 1 shows the results on the AMBER, Object HalBench, and MMHalBench datasets.

Performance: Octopus achieves the lowest hallucination rates (CHAIR scores) across the board.
Comparison: It significantly outperforms single-strategy methods like VCD and AVISC.
Magnitude: On the AMBER dataset, Octopus improved the CHAIR score from 8.0 (Base LLaVA) down to 6.1, a massive reduction in fabrication.

Discriminative Tasks (Yes/No Questions)

They also tested Discriminative Tasks (e.g., POPE benchmark), where the model answers “Yes” or “No” to questions like “Is there a dining table in the image?”

Table 2. Comparison with the state-of-the-art methods for the discriminative tasks.

As seen in Table 2, Octopus achieves higher accuracy and F1 scores than any individual strategy. This proves that dynamically switching strategies helps the model verify facts more accurately.

Why not just combine them randomly?

You might wonder if the complex “Eye” mechanism is necessary. Could we just pick a random strategy or combine them all at once?

Table 3. Ablation study showing the effectiveness of combining strategies intelligently.

Table 3 provides the answer via an ablation study:

Rows 2-5: Using pairs of strategies helps, but not as much as using all of them.
Row 6 (Octopus): The full Octopus framework yields the best results (CHAIR score of 4.8 in this specific ablation run). This confirms that intelligent selection is superior to random selection or static combinations.

Qualitative Proof

Finally, seeing is believing. Let’s revisit the soccer example.

Figure 6. Comparison of generated description with different CD strategies and our method.

In Figure 6, we see the output for the soccer player image:

LLaVA (Base): Hallucinates a “soccer ball” (red text).
VCD / M3ID / AVISC: All continue to hallucinate the ball or other details, as their specific bias-correction wasn’t right for this specific visual confusion.
Octopus: Correctly identifies that the player is “kicking [action]… on a field,” but crucially, it does not hallucinate the ball. It accurately describes the scene.

Conclusion

The “Octopus” paper presents a significant shift in how we handle errors in Large Vision-Language Models. By acknowledging that hallucinations are hybrid—caused by a mix of language priors, visual loss, and attention biases—the researchers moved away from static solutions.

Key Takeaways:

Hallucinations are Dynamic: The cause of error changes from sample to sample and word to word.
The Octopus Solution: A lightweight module (the “Eye”) can learn to predict these errors and assign the correct fix (the “Tentacle”) in real-time.
Deployment Friendly: Because it uses Contrastive Decoding and a small add-on module, it improves performance without the massive cost of retraining the foundational model.

This framework is highly extensible. As researchers invent new decoding strategies (new “Tentacles”), they can simply be added to the Octopus’s toolkit, potentially making LVLMs even more reliable in the future.

Introduction#

The Problem with “One-Size-Fits-All”#

The Limits of Static Strategies#

Hallucinations Happen Token-by-Token#

The Octopus Framework#

The Metaphor: Eye and Tentacles#

Step-by-Step Architecture#

Optimizing the Octopus#

Experiments and Results#

Generative Tasks (Image Captioning)#

Discriminative Tasks (Yes/No Questions)#

Why not just combine them randomly?#

Qualitative Proof#

Conclusion#