Introduction
Imagine asking an AI to describe a picture of a soccer field. The model confidently replies, “A player in a green jersey is kicking the ball toward the goal.” It sounds perfect, except for one problem: there is no ball in the picture.
This phenomenon is known as hallucination. Large Vision-Language Models (LVLMs), despite their incredible ability to understand images and text, frequently fabricate objects, attributes, or relationships that simply don’t exist. For casual use, this is annoying. For critical applications like medical imaging analysis or autonomous driving, it is dangerous.
In this post, we will deep-dive into a fascinating research paper titled “Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding.” The researchers propose a novel framework that treats hallucination not as a single problem, but as a dynamic, shifting challenge. Instead of using a one-size-fits-all solution, they introduce “Octopus”—a method that adaptively switches strategies on the fly to keep the model grounded in reality.
The Problem with “One-Size-Fits-All”
To understand why Octopus is necessary, we first need to look at how researchers currently fight hallucinations. There are generally two schools of thought:
- Retraining: This involves curating high-quality datasets and fine-tuning the model to be more honest. While effective, it is expensive, time-consuming, and computationally heavy.
- Contrastive Decoding (CD): This is a post-hoc method (applied after training) that manipulates the model’s output probabilities during generation. It typically compares the model’s standard output against a “distorted” input to cancel out biases.

As shown in Figure 1, Retraining (a) updates the model weights, while Contrastive Decoding (b) compares logits from original and distorted inputs. The paper introduces Octopus (c), which adds a decision-making layer to dynamically select the best strategy.
The Limits of Static Strategies
Contrastive Decoding (CD) has been a popular fix because it doesn’t require retraining the massive model. Several specific CD strategies exist:
- VCD (Visual Contrastive Decoding): Adds noise to the image to stop the model from relying too much on language priors (e.g., assuming “sofa” is followed by “living room” regardless of the image).
- M3ID: Masks the text to force the model to look at the image more, reducing visual information loss.
- AVISC: Uses “blind tokens” to minimize attention bias.
However, the authors of Octopus posed a critical question: Does one strategy work for every image?
They ran diagnostic experiments using these existing strategies. The results, visualized below, were telling.

Figure 2 reveals a crucial insight: existing CD strategies are complementary, not redundant.
- In the AMBER dataset (generative task), roughly 48% of samples were improved by only one specific strategy.
- Only about 10% of samples benefitted from all strategies.
This implies that if you pick just one method (e.g., VCD) and apply it to everything, you are failing to correct—or potentially worsening—a huge chunk of the data.
Hallucinations Happen Token-by-Token
The researchers went deeper. They discovered that hallucination isn’t just a “per-image” problem; it’s a “per-token” problem. In a single sentence generated by an AI, one word might be hallucinated because of language bias, while the next word is hallucinated because of attention bias.

In Figure 4, look at the attention map. The word “sitting” was triggered by attention bias (Strategy 3), while “lying” was caused by visual information loss (Strategy 2).
This discovery is the foundation of the Octopus method. If the cause of hallucination changes word by word, the solution must also change word by word.
The Octopus Framework
The Octopus framework is designed to sit on top of an existing LVLM (like LLaVA). It does not change the LVLM’s weights. Instead, it acts as a conductor, deciding which decoding strategy to use for every single token generated.
The Metaphor: Eye and Tentacles
The authors use a biological metaphor to describe the architecture:
- The Eye: A mechanism to perceive the current state and detect what kind of hallucination might occur.
- The Tentacles: The various Contrastive Decoding strategies available to fix the problem.

As illustrated in Figure 5, the workflow involves inputting the image and text. An “Eye” token processes the hidden states, and a decision is made to activate a specific “Tentacle” (Strategy k) to correct the output.
Step-by-Step Architecture
Let’s break down the math and mechanics of how Octopus works during the generation process.
1. The Standard LVLM Process
Normally, an LVLM generates a token \(y_t\) based on the image \(v\), the query \(q\), and the previous words \(y_{ 2. The Contrastive Decoding (CD) Basis
In a standard CD approach, the model subtracts the probability distribution of a “distorted” input (like a blurred image) from the original input. This penalizes the model for guessing based on bias. 3. The Octopus “Eye” (Detection)
The core innovation is a small Transformer-based block, denoted as \(O_{\phi}\). At each step \(t\), it takes the hidden states of the LVLM (visual, query, and history) and a learnable token called eye. This eye token aggregates all the context needed to decide if the model is currently hallucinating and why. 4. The Decision (Action)
The output of the eye token is passed through a Multi-Layer Perceptron (MLP) to create an action vector. This vector represents the probability of choosing a specific strategy. The model creates a “one-hot” vector \(a_t\) to select the best strategy (or “tentacle”) for this specific moment. The available actions in this paper are: Here lies a significant challenge: We don’t have labels for this. There is no dataset that says, “For the 5th word of image #103, you should use VCD.” To solve this, the researchers use Direct Preference Optimization (DPO). Instead of needing explicit labels, DPO allows the model to learn from preferences. The researchers generate multiple outputs using different random combinations of strategies. They then measure the hallucination rate of these outputs (using a metric like CHAIR). The Octopus is trained to maximize the likelihood of selecting the strategies used in the Positive Sample and minimize the Negative Sample. This allows the “Eye” to learn intuitively which situations require which “Tentacle” without ever needing a human to manually label the strategies. Does this dynamic switching actually work? The authors tested Octopus against state-of-the-art methods across several benchmarks. The primary test is the Generative Task: asking the model to describe an image. The metrics used include CHAIR (Caption Hallucination Assessment with Image Relevance), where a lower score is better because it means fewer hallucinations. Table 1 shows the results on the AMBER, Object HalBench, and MMHalBench datasets. They also tested Discriminative Tasks (e.g., POPE benchmark), where the model answers “Yes” or “No” to questions like “Is there a dining table in the image?” As seen in Table 2, Octopus achieves higher accuracy and F1 scores than any individual strategy. This proves that dynamically switching strategies helps the model verify facts more accurately. You might wonder if the complex “Eye” mechanism is necessary. Could we just pick a random strategy or combine them all at once? Table 3 provides the answer via an ablation study: Finally, seeing is believing. Let’s revisit the soccer example. In Figure 6, we see the output for the soccer player image: The “Octopus” paper presents a significant shift in how we handle errors in Large Vision-Language Models. By acknowledging that hallucinations are hybrid—caused by a mix of language priors, visual loss, and attention biases—the researchers moved away from static solutions. Key Takeaways: This framework is highly extensible. As researchers invent new decoding strategies (new “Tentacles”), they can simply be added to the Octopus’s toolkit, potentially making LVLMs even more reliable in the future.





Optimizing the Octopus

Experiments and Results
Generative Tasks (Image Captioning)

Discriminative Tasks (Yes/No Questions)

Why not just combine them randomly?

Qualitative Proof

Conclusion
](https://deep-paper.org/en/paper/2503.00361/images/cover.png)