Introduction

Imagine showing an AI a picture of a snowy forest and asking it to describe what it sees. The model confidently describes the snow, the trees, and then adds, “…and a squirrel is eating a nut on the branch.” You look closer. There is a squirrel, but it’s jumping, not eating. And it’s on the ground, not a branch.

This phenomenon is known as hallucination. In Large Vision-Language Models (LVLMs)—the systems behind tools like GPT-4V or LLaVA—it is a persistent and critical issue. While these models have demonstrated incredible prowess in understanding visual data, they frequently generate content that either contradicts the image or invents objects that simply aren’t there.

In this post, we are exploring a fascinating research paper titled “HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding.” The authors propose a comprehensive framework that tackles hallucination from two angles: how the model is trained and how the model “thinks” when it generates text.

Figure 1: A case of LVLM hallucination. The parts marked in red are,in fact, hallucinations.The parts marked in blue would be mistaken for hallucinations by detection methods that focus only on objects.

As shown in Figure 1, hallucination isn’t black and white. If a detection method only checks for the existence of objects (Object-level), it might flag “predators” or “food” as hallucinations because those specific words don’t label an object in the image. However, semantically, they make sense in the context of a squirrel’s life. The authors of HELPD argue that we need a smarter way to detect and penalize these errors—one that understands both objects and semantics.

Background: Why Do LVLMs Hallucinate?

To understand the solution, we first need to diagnose the problem. LVLMs generally consist of a visual encoder (like a Vision Transformer) connected to a Large Language Model (LLM). The goal is to align visual features with text embeddings so the LLM can “see.”

However, previous research identified a phenomenon called “Over-trust.” This occurs when the model relies too heavily on the text it has already generated, ignoring the visual input. It creates a self-fulfilling prophecy: if the model generates the word “table,” it becomes statistically biased to generate “chair” next, even if there is no chair in the image.

Figure 2: Attention visualization of LVLMs. For the same input, each image represents the attention matrix of a specific LVLM generation instance. Red indicates the attention of the image,while green represents the phenomenon of “Over-trust" in the generated text.

As seen in Figure 2, the authors analyzed the attention matrices of popular models like InstructBLIP and MiniGPT-4.

  • The Red boxes show the model actually is paying attention to the image.
  • The Green boxes show the “Over-trust” in the generated text (the vertical bars on the right).

The authors realized that existing solutions often penalize the text over-trust but fail to reinforce the visual attention. This observation led to the development of HELPD, a two-pronged framework:

  1. Hierarchical Feedback Learning: A training method that uses Reinforcement Learning (RL) to punish hallucinations at both the object and sentence levels.
  2. Vision-enhanced Penalty Decoding: A sampling strategy used during generation that balances text penalties with visual bonuses.

The Core Method: HELPD

Let’s break down this architecture. The brilliance of HELPD is that it attacks the problem at two different stages of the model’s lifecycle: fine-tuning and inference.

Figure 3:This diagram illustrates the framework of HELPD.The Hierarchical Feedback Learning detects hallucination by obtaining object-level feedback from comparing object sets extracted from sampled and label sentences,and sentence-level fedback through semantic comparison using GPT-4’s few-shot inference capabilities. To improve the efectiveness of sampling,the Vision Penalty Decoding augments the over-trust penalty score with a vision-enhanced penalty score, making the final logits closer to the image.

Figure 3 provides the high-level roadmap. On the left, we see the feedback learning loop (training), and in the center/right, we see the decoding penalty strategy (inference).

Part 1: Hierarchical Feedback Learning

Standard training uses Cross-Entropy loss, which just checks if the model predicted the exact next word correctly. It doesn’t care if the sentence is factually wrong, as long as the grammar and vocabulary probability mimic the training data.

HELPD introduces a fine-tuning stage using Reinforcement Learning (RL). The model generates a sentence, and the system calculates a “Reward” based on how accurate the sentence is. If the reward is high, the model is encouraged to repeat that behavior; if low, it is discouraged.

The “Hierarchical” part means the reward comes from two levels:

1. Object-Level Feedback (\(R_{obj}\))

This checks if the specific nouns generated by the model actually exist in the ground truth labels.

  • \(S_{sam}\): The set of objects in the model’s sampled sentence.
  • \(S_{lab}\): The set of objects in the ground truth label.

The authors calculate the Precision and Recall between these two sets.

Equation for Precision

Equation for Recall

The final Object-Level Reward (\(R_{obj}\)) is the F1 score, which balances precision and recall:

Equation for Object-level Reward

This metric ensures the model isn’t hallucinating random objects (Precision) and isn’t missing key objects (Recall).

2. Sentence-Level Feedback (\(R_{sen}\))

Objects aren’t everything. As we saw in the squirrel example, the relationship between objects matters. To capture this, the authors utilize GPT-4. They use few-shot prompting to feed the generated sentence and the ground truth to GPT-4, asking it to score the semantic accuracy on a scale of 0 to 1. This provides the Sentence-Level Reward (\(R_{sen}\)).

3. Combining Rewards via RL

The total reward \(R_i\) is a weighted sum of the semantic and object scores, controlled by a hyperparameter \(\sigma\):

Equation for Total Reward

Since these rewards (F1 score and GPT-4 output) are non-differentiable (you can’t calculate a gradient directly through GPT-4), the authors use the Reinforce algorithm. This is a standard policy gradient method in RL.

First, they look at the log probabilities of the specific actions (words) the model chose:

Equation for Log Probability

Then, they compute the RL loss. This loss encourages the model to increase the probability of sequences that yielded high rewards (\(R_i\)):

Equation for RL Loss

4. The Curriculum Strategy

You cannot start training a model with RL immediately; it needs to understand the basics of language first. Therefore, HELPD applies a curriculum. For the first portion of training steps (defined by ratio \(c\)), it uses standard Cross-Entropy (\(\mathcal{L}_{CE}\)). Once the model is stable, it switches to a combined loss of Cross-Entropy and Reinforcement Learning:

Equation for Total Loss with Curriculum

This ensures the model maintains linguistic quality while learning to be factually accurate.


Part 2: Vision-enhanced Penalty Decoding

Training takes time and compute. What if we could also reduce hallucinations instantly, just by changing how we select the next word? This is where Vision-enhanced Penalty Decoding comes in.

Existing methods like “Opera” use an Over-trust Penalty. They look at the attention matrix of the generated text. If the model is staring too hard at its own previous words (a sign of a loop or hallucination), the method subtracts a penalty from the logits (the probability scores of the next word) to force the model to look elsewhere.

The authors of HELPD argue this is only half the solution. If you penalize text reliance, you should simultaneously boost image reliance.

Figure 4: The illustration of Vision-enhanced Penalty Decoding. The total penalty is composed of the vision penalty and the over-trust penalty. The over-trust penalty is computed based on the generated text (the upper region), while the vision penalty is computed from the vision attention window (the lower area).

Figure 4 illustrates this dual-pathway:

  1. Top path: Calculates the standard Over-trust score based on text tokens.
  2. Bottom path: Calculates a Vision-enhanced score based on image tokens.

The Math of Vision Attention

The system looks at a specific window of the attention matrix corresponding to the image features (\(\mathbf{W}_l^h\)):

Equation defining the Vision Window

It calculates a score (\(\psi\)) representing how much attention the model is paying to the visual input. This is done by aggregating the attention weights in that window:

Equation for Vision-enhanced Score

Calculating the Final Penalty

The final penalty term \(\rho\) isn’t just a subtraction. It balances the text over-trust penalty (\(\phi\)) and the vision score (\(\psi\)).

Equation for Overall Penalty Weight

Here, \(\beta\) acts as a balancing term. The logic is elegant: We want to penalize tokens that have high text over-trust unless they also have high visual support. By subtracting the vision score (weighted by \(\beta\)) from the text penalty, the model is “forgiven” for trusting the text if that trust is backed up by the image.

Finally, this adjusted penalty is applied to the probability distribution before sampling the next word:

Equation for Final Logit Adjustment

This forces the model to choose words that are not just linguistically probable, but visually grounded.

Experiments and Results

The researchers tested HELPD on several standard benchmarks using models like LLaVA-1.5 and mPLUG-Owl2.

Hallucination Benchmarks

  1. POPE (Polling-based Object Probing Evaluation): Asks “Is there a [object] in the image?” questions.
  2. CHAIR (Caption Hallucination Assessment with Image Relevance): Measures the percentage of generated objects that aren’t in the image.
  3. GAVIE & MMHal-Bench: Uses GPT-4 to act as a judge for relevance and accuracy.

Key Results

1. Improvements on POPE

On the POPE benchmark (Table 1 below), we see consistent improvements. The F1 scores for models equipped with HELPD (“w/ ours”) are higher across the board compared to the base models.

Table 1: Results ofLVLMs under three evaluation setings of POPE on the validation set of MSCOCO.“Yes" denotes the proportion of answering“Yes” to the given question.“w/ ours” means the application of HELPD.

2. Reducing Hallucinations in Captions (CHAIR)

The CHAIR metric is critical because it measures free-form generation. Lower scores are better here (meaning fewer hallucinations).

Table 2:CHAIR hallucination evaluation results.“w/ ours" means the use of hierarchical fedback learning, and “Vep” is the vision-enhanced penalty decoding.

As shown in Table 2:

  • \(CHAIR_s\) (Sentence level hallucination) dropped significantly. For mPLUG-Owl2, it went from 46.6 down to 22.4. That is a massive reduction in errors.
  • Crucially, looking at the “Len” column, the length of the generated sentences didn’t drop. The model isn’t just saying less to be safe; it’s saying the same amount, but accurately.

3. Holistic Performance (MMHal-Bench)

The radar chart in Figure 5 visualizes performance across 8 categories (like counting, spatial relationships, attributes). The red line (LLaVA + HELPD) consistently envelopes the orange line (LLaVA base), showing broad improvements.

Figure 5: Detailed performance of LVLMs on the eight categories in MMHAL-Bench, where “Overall" indicates the averaged performance across all categories. “w/ours” means the application of HELPD.

Ablation Study: Do we need both levels of feedback?

A skeptic might ask: “Is the complicated GPT-4 feedback necessary? Can’t we just check for objects?”

Table 4: Ablation results on different levels of feedback on CHAIR and GAVIE. R_obj and R_sen represent object-level and sentence-level feedback,respectively.

Table 4 answers this.

  • Row 1: No feedback (Base model).
  • Row 2: Only Object feedback (\(R_{obj}\)).
  • Row 3: Only Sentence feedback (\(R_{sen}\)).
  • Row 4: Both.

Using both consistently yields the lowest hallucination scores (\(C_s\) and \(C_i\)). Interestingly, sentence-level feedback (\(R_{sen}\)) seems slightly more powerful than object-level if you had to pick one, likely because it captures context better, but the combination is superior.

Qualitative Examples

Let’s look at a real-world correction to see HELPD in action.

Figure 8: An illustrative case is presented to compare the output of mPLUG-Owl2 and mPLUG-Owl2 with HELPD.

In Figure 8, the base model (mPLUG-Owl2) hallucinates that the lion is “standing over a kill,” likely because lions in photos are often doing that. It ignores the visual evidence of the chase. The model with HELPD correctly identifies the action: the lion is “chasing a wildebeest.” The generated text is grounded in the actual visual dynamics of the image.

Conclusion

The HELPD framework represents a significant step forward in making Multimodal AI reliable. By recognizing that hallucination stems from both how the model learns (training incentives) and how the model speaks (decoding strategies), the authors provide a robust solution.

The key takeaways for students and researchers are:

  1. Metrics matter in training: Moving beyond simple Cross-Entropy to Reinforcement Learning allows us to optimize for complex goals like “factual consistency,” not just “word probability.”
  2. Decoding is active: We don’t have to accept the raw output of a model. By inspecting the attention mechanisms during inference (Vision-enhanced Penalty Decoding), we can steer the model back towards the image and away from its own hallucinations.
  3. Hierarchy is essential: Evaluating truth requires checking both the atomic facts (objects) and the holistic picture (semantics).

HELPD achieves all this with only a marginal amount of additional training, making it a practical and efficient addition to the LVLM toolkit.