Image captioning is one of the most fundamental intersections of Computer Vision and Natural Language Processing (NLP). It requires a machine to look at an image and describe it in human language. In recent years, Vision-Language Models (VLMs) like BLIP and GIT have become incredibly fluent, generating detailed and grammatically correct descriptions.

But they have a lying problem.

In the field of AI, we call this hallucination. This occurs when a model generates details—objects, actions, or attributes—that simply aren’t present in the image. This isn’t just a quirk; it is a critical reliability issue. If an AI describes a “man holding a gun” when he is holding a drill, or a “child on a skateboard” when they are jumping on stairs, the consequences ranges from user frustration to dangerous misinformation.

Today, we are diving deep into a research paper titled “Mitigating Open-Vocabulary Caption Hallucinations.” The authors propose a two-pronged solution: a new benchmark called OpenCHAIR to better measure these errors, and a reinforcement learning framework called MOCHa to fix them.

Hallucinated details in modern image captioning models.

As shown in Figure 1, even state-of-the-art models like BLIP-2 can hallucinate a “skateboard” simply because the context (people jumping) is statistically correlated with skateboarding in the training data. The proposed MOCHa framework aims to correct this, generating the factually accurate: “Several people jumping up and down a flight of stairs.”

The Problem: Closed vs. Open Vocabularies

To solve hallucinations, we first need to measure them. For years, the gold standard metric has been CHAIR (Caption Hallucination Assessment with Image Relevance).

CHAIR works by checking captions against a fixed list of 80 common objects from the MS-COCO dataset (e.g., “dog,” “chair,” “car”). If the model mentions a “dog” but the ground truth annotations for that image don’t list a “dog,” it counts as a hallucination.

However, the real world contains more than 80 objects. The authors of this paper argue that existing methods ignore the long-tailed nature of hallucinations. Modern generative models are “open-vocabulary”—they can talk about anything from “grandfathers” to “pinecones” to “space shuttles.” If we only test them on 80 categories, we are missing the vast majority of errors they make.

CHAIR Limitations vs OpenCHAIR.

Figure 6 illustrates the limitations of the legacy CHAIR metric.

  1. Limited Vocabulary (Left): The model predicts “Scissors,” “Pencil,” “Spool,” and “Thread.” CHAIR only knows “Scissors.” It ignores the other predictions entirely, failing to evaluate whether the “Pencil” is real or hallucinated.
  2. Coarse Synonyms (Right): CHAIR groups objects into broad categories. It considers “Goose” and “Duck” to be synonyms for “Bird.” If the image contains a goose, but the model hallucinates a duck, CHAIR marks this as correct because they fall in the same bucket.

To build trustworthy AI, we need a benchmark that operates in the open-vocabulary setting.

Contribution 1: The OpenCHAIR Benchmark

The researchers introduce OpenCHAIR, a benchmark designed to evaluate hallucinations across a massive variety of objects, not just a pre-defined list.

Creating a dataset with diverse objects and accurate labels is expensive and time-consuming if done manually. The authors devised a clever automated pipeline leveraging Generative AI to build this benchmark.

The OpenCHAIR Benchmark construction pipeline.

How OpenCHAIR is Built

As outlined in Figure 2, the process involves two phases: Dataset Construction and Evaluation.

  1. Dataset Construction (Left):
  • Seed: They take existing captions from the COCO dataset.
  • LLM Expansion: A Large Language Model (LLM) rewrites these captions to include diverse, specific, and rare objects (e.g., changing “A dog near a tree” to “A dragon near a castle”).
  • Image Synthesis: These new, rich captions are fed into a text-to-image diffusion model (Stable Diffusion XL) to generate synthetic images that match the text perfectly.
  • Result: A dataset of images paired with ground-truth captions containing over 2,400 unique object types—a 30x increase over CHAIR’s 80 objects.
  1. Evaluation (Right):
  • A captioning model (the model being tested) looks at the synthetic image and generates a description.
  • The system parses the objects in that description.
  • LLM as a Judge: Instead of checking against a hard-coded list, an LLM compares the predicted object against the ground truth caption. It asks: Does the ground truth caption imply the presence of this object?

OpenCHAIR vs. CHAIR evaluation logic.

This evaluation method allows for much finer granularity. As shown in Figure 4, if a model predicts “A man playing the guitar” for an image of a child playing drums:

  • CHAIR might miss the error if “guitar” isn’t in its list, or excuse “man” as a synonym for “child.”
  • OpenCHAIR correctly identifies that “Man” and “Guitar” are hallucinations because they contradict the specific ground truth of the scene.

Contribution 2: The MOCHa Framework

Having established a way to measure the problem, the authors introduce MOCHa (Multi-Objective reinforcement learning for Caption Hallucinations) to fix it.

Why not just standard training?

Most image captioning models are trained using “Teacher Forcing” (minimizing Cross-Entropy loss). This teaches the model to predict the next token in a sequence. However, factual groundedness is a sequence-level property, not a token-level one. Predicting a word that is statistically probable but factually wrong (like “skateboard” in the jumping context) minimizes loss but ruins fidelity.

MOCHa uses Reinforcement Learning (RL), specifically Proximal Policy Optimization (PPO), to fine-tune the model. This allows the system to reward the model for the quality of the entire sentence rather than just the probability of the next word.

MOCHa scheme.

The Multi-Objective Reward Function

The core innovation of MOCHa is its reward function. If you only punish hallucinations, the model might become too scared to say anything specific, generating vague captions like “An image of a thing.” If you only reward detail, it might start hallucinating again.

MOCHa balances these competing needs using three distinct reward components, as shown in the diagram above:

  1. Fidelity (The Truth Monitor):
  • They use a Natural Language Inference (NLI) model.
  • This model checks if the generated caption logically contradicts the ground truth caption.
  • If the model says “cat” and the image contains a “dog,” the NLI model flags a contradiction. This is the primary defense against hallucinations.
  1. Adequacy (The Detail Enforcer):
  • They use BERTScore, a metric that compares the semantic similarity of the generated caption to the reference.
  • This ensures the model covers the necessary details and remains descriptive.
  1. KL-Penalty (The Anchor):
  • This is a regularization term. It forces the model not to deviate too wildly from its original learned probability distribution.
  • This prevents the model from “gaming” the reward system (producing garbage text that happens to trick the NLI model) and maintains the fluency of the language.

Ablation: Why we need all three

The synergy between these objectives is crucial. The authors provide a compelling visual ablation study demonstrating what happens when you remove parts of the reward function.

Ablating the multi-objective reward function.

  • No Optimization (\(\emptyset\)): The base model hallucinates a “surfer” or “woman.”
  • No Fidelity (\(-r_f\)): Without the truth monitor, the model hallucinates wildly (e.g., seeing a “surfer” in a field).
  • No Adequacy (\(-r_a\)): Without the pressure to be descriptive, the model becomes overly cautious and vague, outputting sentences like “Spectators could enjoy the old fashions,” which says almost nothing about the visual content.
  • Full MOCHa (\(r\)): The model produces specific, accurate captions: “A vintage car parked on a field.”

Experiments and Results

The researchers applied MOCHa to several state-of-the-art models, including BLIP, BLIP-2, and GIT. The results show a consistent improvement across the board.

Quantitative Success

Looking at Figure 7 below, we can see the relative improvement percentages.

Quantitative results graph.

  • Fidelity (Left): Metrics that track truthfulness, such as NLI scores and OpenCHAIR (OCH), show significant improvement. The models are hallucinating significantly less.
  • Quality (Right): Crucially, standard caption quality metrics like BERTScore and CIDEr also improve. This confirms that MOCHa isn’t just making the models “safer” by making them boring; it is actually making them better captioners overall.

Qualitative Success

The numbers are backed up by visual evidence. In the figure below, we see a comparison between the baseline (B) and the MOCHa-tuned version (B+M).

Qualitative results of MOCHa.

  • Left Image: The baseline sees a man in a “suit and tie.” MOCHa correctly identifies the “military uniform.”
  • Middle Image: The baseline guesses “apples.” MOCHa generalizes correctly to “pan of food,” realizing it cannot identify the specific baked goods with certainty.
  • Right Image: The baseline mentions a “cell phone” (likely a statistical guess based on the hand pose). MOCHa corrects this to “using a laptop computer,” which is actually visible.

The Fidelity-Adequacy Trade-off

One of the most interesting findings is the “Pareto frontier” created by tuning the weighting of the rewards (the \(\alpha\) parameter).

Fidelity-Adequacy graphs.

As shown in Figure 9, there is a trade-off. As you increase the weight of the Fidelity reward, Adequacy might drop (the model gets more concise). However, the MOCHa optimization curve (the connected dots) bulges outward from the initial model (red dot). This means there are configurations where both fidelity and adequacy are better than the original model. The optimization pushes the model toward a “sweet spot” of descriptive accuracy.

The Broader Landscape

To understand where MOCHa fits, it helps to look at the taxonomy of hallucination research.

VLM Caption Hallucination Taxonomy.

Previous work has largely focused on Closed Vocabulary approaches (the red chairs in the diagram), trying to fix hallucinations for specific lists of objects. MOCHa and OpenCHAIR represent a shift toward Open Vocabulary (the green chairs), acknowledging that in the era of Generative AI, our models need to be robust enough to handle the infinite variety of the real world.

Conclusion

The “Mitigating Open-Vocabulary Caption Hallucinations” paper represents a maturing of the Vision-Language field. We are moving past the phase of simply being impressed that a computer can write a sentence, and entering a phase where we demand reliability and factual grounding.

By creating OpenCHAIR, the authors have given the community a much-needed yardstick to measure hallucinations in a realistic, open-ended context. With MOCHa, they have demonstrated that Reinforcement Learning, guided by a clever multi-objective reward function, can effectively rein in the imagination of these models without stifling their descriptiveness.

For students and practitioners, the key takeaway is that accuracy is not just about training data; it’s about the objective function. Standard language modeling objectives are insufficient for factual grounding. To build AI we can trust, we must explicitly reward the model for telling the truth, not just for predicting the most probable next word.