In the rapidly evolving world of Computer Vision, we often equate “more” with “better.” More data, more parameters, and—recently—more words.
For years, image captioning models were trained on datasets like COCO, where a caption might be as simple as: “A dog sitting on a chair.” It’s accurate, but dry. With the rise of Large Language Models (LLMs) and Multimodal Models (like GPT-4V), researchers found a new trick: Generative Caption Enrichment (GCE). Instead of using short, human-written captions, we can ask an LLM to generate detailed, paragraph-long descriptions.
The logic is sound: if a model trains on richer, more descriptive text, it should understand images better, right?
According to a fascinating new paper by Hirota, Hachiuma, Yang, and Nakashima, the answer is “Yes, but at a steep cost.” Their research, titled “From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment,” reveals that while these enriched captions cover more visual details, they also introduce significant side effects: societal bias and hallucination.
In this post, we will break down their evaluation framework, analyze the trade-off between descriptiveness and accuracy, and discover why models trained on this “rich” data might actually be learning to be more prejudiced and less truthful.

The Shift to Generative Caption Enrichment (GCE)
To understand the problem, we first need to understand the technique. Traditional datasets rely on human annotators who write concise sentences. However, collecting millions of detailed human captions is expensive and slow.
Generative Caption Enrichment (GCE) solves this by using powerful AI models to rewrite or generate new captions for existing images. The paper examines three leading GCE methods:
- ShareGPT4V: Uses GPT-4 Vision to generate highly detailed captions.
- FuseCap: Uses object detectors to find items in an image, then uses an LLM (ChatGPT) to fuse those items into a coherent sentence.
- CapsFusion: Takes a caption from a standard model (BLIP) and fuses it with the original concise caption using ChatGPT.
These methods produce captions that are semantically rich. But as the authors ask: Are there negative side effects?
The Evaluation Framework
The researchers set up a rigorous experiment to compare standard captions (COCO) against the enriched captions generated by the three methods above. They measured performance across three pillars: Quality, Societal Bias, and Hallucination.
1. Measuring Quality
Quality isn’t just about grammar; it’s about content. The primary metric here is Recall. This measures how many objects actually present in the image are mentioned in the text.

Here, \(r_i\) is the number of relevant objects mentioned, and \(o_i\) is the total number of objects in the image. A higher recall means the caption is more descriptive.
2. Measuring Societal Bias
The authors focused specifically on gender bias, as gender terms appear frequently in captions. They used three distinct metrics:
- Gender Error: Does the caption call a woman a “man” (or vice versa)?
- LIC (Language-Image Association for Gender): This complex metric compares a classifier trained on the captions to one trained on ground truth. If the caption-based classifier is “too accurate” based on stereotypes rather than visual evidence, the LIC score goes up. High LIC indicates the model is relying on gender stereotypes (e.g., assuming a person in a kitchen is a woman).
- Recall Disparity: This checks if the model notices objects differently depending on the gender of the person in the image. For example, does the model mention a “tie” more often if a man is wearing it compared to a woman?

3. Measuring Hallucination
Hallucination in AI refers to the generation of content that isn’t grounded in reality. In image captioning, this means describing objects that aren’t in the picture. The authors used the CHAIR (Captioning HAllucination Assessment with Image Relevance) metrics.
CHAIR\(_i\) (Instance): What percentage of the objects mentioned in the caption don’t exist?

CHAIR\(_s\) (Sentence): What percentage of sentences contain at least one hallucination?

The Experiment: Upstream vs. Downstream
This is the most critical part of the paper’s methodology. The authors didn’t just look at the text; they looked at the impact of the text.
- Upstream Analysis: Analyzing the datasets themselves. How biased are the captions generated by GPT-4V or FuseCap?
- Downstream Analysis: Training a new Image Captioning model (BLIP) on these enriched datasets and testing it. Does the model learn the bad habits of the dataset?
Results: The Cost of Richness
The results paint a clear picture of a trade-off. As we pursue richer descriptions, we inadvertently invite bias and inaccuracies.
Observation 1: More Descriptive = More Biased
The data shows a strong correlation: as captions become more descriptive (higher Recall), gender bias increases.
Take a look at the chart below. The x-axis represents Recall (descriptiveness), and the y-axis represents LIC (bias). You can see a near-perfect linear relationship (\(R^2 = 0.99\)). The standard COCO captions (blue dot) are low on recall but also near-zero on bias. ShareGPT4V (red dot), which is the most descriptive, has the highest bias score.

Why does this happen? LLMs are trained on vast amounts of internet text, which contain societal stereotypes. When an LLM “enriches” a caption, it doesn’t just look at the pixels; it relies on its internal statistical probabilities. If it sees a kitchen, it might statistically infer a female presence or use stereotypically feminine adjectives, even if the image doesn’t support it.
This bias is further highlighted by Recall Disparity. The chart below shows how often specific objects are mentioned depending on the gender of the person in the image.

Notice the “Handbag” category. In standard COCO captions (striped bars), the disparity is low. In ShareGPT4V (solid bars), there is a massive disparity favoring men (meaning handbags are paradoxically mentioned more or less frequently in a skewed way relative to the gender, or the model hallucinates them based on gender context). The enriched captions are treating objects differently based on who is holding them.
Observation 2: More Descriptive = More Hallucinations
The trend continues with hallucinations. The more an LLM writes, the more likely it is to make things up.

Standard COCO captions rarely hallucinate (CHAIR\(_s\) is near 0). ShareGPT4V, however, has a hallucination rate of over 20%. When an LLM tries to write a “story” about an image, it often embellishes details—adding colors, emotions, or background objects that simply aren’t there.
Observation 3: The Amplifier Effect (Downstream)
Perhaps the most concerning finding is what happens when we train new models on this data. You might hope a model would filter out the noise. Instead, the downstream models amplify the problems.
Let’s look at the raw numbers.

In the table above, look at the Downstream section. Models trained on ShareGPT4V inherited high bias (LIC 14.3) and high hallucination (CHAIR\(_s\) 21.5).
The amplification is even clearer in the difference table below. Positive numbers (Red) mean the problem got worse after training.

Models trained on enriched captions amplified gender bias by an average of 30.9% and increased hallucination by 59.5%. This suggests that these models are not just memorizing the data; they are latching onto the biases and hallucinations as “patterns” to be learned and repeated.
Qualitative Examples: Seeing the Failure Modes
To make this concrete, let’s look at some examples where the enriched captions failed.
In Figure 5 below, comparing COCO to ShareGPT4V, we see the “richness” in action. The ShareGPT4V caption is a paragraph long. However, look at the yellow highlighted text. It describes a “building with a gray roof” and “intricate architectural details.” Looking at the image (bottom), it’s a blurry background that barely supports that description. The model is hallucinating details to make the scene sound more “complete.”

In Figure 6 (FuseCap), the failures are even more bizarre. In the top image, FuseCap hallucinates a “red wall” and a “multi-colored tie” for a person waterskiing. In the bottom image, it hallucinates a “man in black glasses” standing nearby, who simply does not exist in the photo. This is a classic case of the model probabilistically predicting what usually appears in restaurant scenes, rather than sticking to what is actually visible.

Conclusion: The Double-Edged Sword
This research serves as a crucial “check engine” light for the computer vision community. We are currently in an era where we assume that using LLMs to expand our datasets is a free win.
The authors conclude that while Generative Caption Enrichment (GCE) does improve the descriptiveness of captions (higher recall), it introduces a “dark side”:
- Exacerbated Gender Bias: LLMs inject stereotypes into image descriptions.
- Increased Hallucination: The drive for detail leads to fabrication.
- Downstream Amplification: Models trained on this data don’t just copy these errors; they magnify them.
What should we do?
The authors suggest that we cannot rely solely on automated enrichment. We need:
- Human-in-the-loop systems: To verify and correct LLM outputs.
- Better Metrics: We need to measure bias and hallucination as standard practice, not just recall or accuracy.
- Balance: We must strike a balance between descriptive richness and factual integrity.
As we build the next generation of Vision-Language Models, we must remember: a picture is worth a thousand words, but only if those words are true.
](https://deep-paper.org/en/paper/2406.13912/images/cover.png)