Artificial Intelligence has a bias problem. If you ask a standard Text-to-Image (T2I) model to generate a picture of a “CEO,” it will likely show you a white man in a suit. If you ask for a “nurse,” it will likely show you a woman. To combat these stereotypes, developers have introduced “diversity interventions”—hidden instructions that prompt the model to inject gender and racial variety into its outputs.
But what happens when we apply these modern diversity standards to specific historical events?
Imagine asking an AI to depict the “Founding Fathers of the United States.” A historically accurate image would depict a group of white men. However, a model operating under a blanket diversity intervention might generate an image featuring women and people of color signing the Declaration of Independence. While well-intentioned, this output is historically nonfactual.
This phenomenon is what researchers from UCLA call the “Factuality Tax.” In their recent paper, The Factuality Tax of Diversity-Intervened Text-to-Image Generation, they explore the trade-off between promoting social diversity and maintaining historical accuracy. They introduce a new benchmark, DoFaiR, to measure this problem, and a new method, Fact-Augmented Intervention (FAI), to solve it.
In this deep dive, we will unpack how diversity prompts distort historical reality, how the researchers measured this distortion, and the clever retrieval-augmented solution they proposed to fix it.
The Core Problem: Diversity vs. Factuality
The tension lies between two desirable goals: representation and accuracy.
Text-to-Image models like DALL-E 3 and Stable Diffusion are trained on massive datasets from the internet, which contain deep-seated social biases. Without intervention, these models reinforce stereotypes. The industry standard solution has been to append “diversity prompts” to user queries behind the scenes—instructions like “ensure diverse racial and gender representation.”
However, these interventions are often “blind.” They apply the same logic to a prompt about “a group of modern engineers” as they do to “Dutch colonial forces in the 19th century.”
As illustrated in Figure 1, this leads to a dilemma.

In the top example, the default model generates a historically accurate (factual) depiction of the Founding Fathers. In the bottom example, a diversity intervention (“irrespective of their skin color or race”) causes the model to generate a nonfactual image. The researchers pose a critical question: Does achieving diversity require us to pay a tax in factuality?
Building a Ruler: The DoFaiR Benchmark
To answer that question quantitatively, the researchers needed a dataset. They couldn’t simply rely on anecdotal evidence; they needed a rigorous benchmark containing historical events where the demographics are known facts.
They created DoFaiR (DemOgraphic FActualIty Representation), a benchmark consisting of 756 meticulous entries. Each entry contains a historical event, a participant class (e.g., “Founding Fathers”), and the ground-truth demographic distribution.
The Data Construction Pipeline
Creating this dataset manually would be prohibitively slow. Instead, the authors designed a sophisticated automated pipeline involving Large Language Models (LLMs) and web retrieval.

The pipeline operates in three stages:
- Event & Participant Generation: An LLM generates a list of historical events across different time periods (from the 1700s to present) and cultures (Africa, Asia, Europe, Americas).
- Fact Retrieval: The system generates search queries to find the demographic makeup of the participants in those events. It retrieves documents from Wikipedia to ensure reliability.
- Fact Labeling: Another LLM reads the retrieved documents to determine the Dominant group (e.g., White, Male) and the Involved groups (e.g., White, Black, Male).
This process ensured the dataset was balanced across different regions and time periods, preventing the benchmark itself from being biased toward Western history.

As shown in Figure 4, the resulting dataset covers a wide spectrum of racial and gender distributions, ensuring that the evaluation tests the model’s ability to handle diverse facts, not just diverse outputs. To double-check the quality, human experts verified a subset of the data, finding a high degree of accuracy (>92%) in the automated labels.

The Evaluation Pipeline
With the dataset in hand, the researchers needed a way to score the T2I models. How do you measure the “factuality” of an image?
The evaluation process, visualized in Figure 2, involves generating an image for a specific historical event and then using computer vision to analyze the faces in that image.

The pipeline works as follows:
- Prompt: Input the historical event (e.g., “Establishment of US”) and participant class.
- Generation: The T2I model (e.g., DALL-E 3) generates the image.
- Face Detection: A system identifies all human faces in the image.
- Demographic Classification: The FairFace classifier analyzes each face to predict its race (White, Black, Indian, East Asian, etc.) and gender.
- Comparison: The predicted distribution is compared against the ground truth from the DoFaiR dataset.
The Metrics: Calculating the Tax
To quantify the “tax,” the authors defined four specific mathematical metrics. These equations help translate visual errors into hard numbers.
1. Dominant Demographic Accuracy (DDA): This measures if the model correctly identified the majority group. For the Founding Fathers, if the image shows mostly white men, the DDA is high.

2. Involved Demographic Accuracy (IDA): This is stricter. It asks: Did the model include all the groups that were present, and exclude the ones that weren’t?

3. Involved Demographic F-1 Score (IDF): This is a balanced score that accounts for both precision (not generating wrong groups) and recall (not missing right groups).

4. Factual Diversity Divergence (FDD): This metric is unique to this study. It measures the difference between the diversity in the image and the diversity in reality. A high FDD score means the model is “forcing” diversity where it doesn’t historically exist.

The Results: The Cost of Intervention
The researchers tested two major models: Stable Diffusion v2.0 and DALL-E 3. They compared a baseline (no intervention) against prompts specifically designed to boost diversity (e.g., “irrespective of skin color”).
The quantitative results were striking.

Key Takeaways from the Data:
- The Tax is Real: Looking at the table, when diversity interventions are applied (the rows labeled “Diversity Intervention”), the DDA (Accuracy) drops significantly compared to the Baseline. For DALL-E 3 on racial data, accuracy dropped from roughly 77% to 72%.
- Diversity Explodes: The FDD (Divergence) score spikes. This confirms that the models are inserting demographic groups that were not historically present.
- Race is Harder than Gender: The models struggled more with racial factuality than gender factuality. The error rates for racial distributions were consistently higher.
Qualitative analysis (looking at the actual images) reinforces these numbers. In Figure 5, the charts show that in many cases where diversity increased, correctness decreased.

Figure 6 provides visceral examples of these failures. In the bottom left, the “Battle of Carabobo” involved predominantly Black participants, but the diversity intervention inserted white and other non-Black soldiers, actually diluting the historical diversity of the event. This is a crucial irony: blind diversity interventions can sometimes erase actual historical diversity.

The Solution: Fact-Augmented Intervention (FAI)
Simply turning off diversity interventions isn’t a good solution, because we still want to avoid bias in non-historical or generic contexts. The researchers proposed a smarter approach: Fact-Augmented Intervention (FAI).
The idea is to give the model a “history lesson” before it draws the picture.
Why “Think Step-by-Step” Failed
First, they tried a standard technique called “Chain-of-Thought” (CoT), prompting the model to “think step by step” about the demographics. As seen in Figure 7 (second column), this failed. The model might reason correctly that a group was all-male, but the strong diversity instruction would override that logic, resulting in a nonfactual image anyway.

The FAI Method
The researchers developed two variations of Fact-Augmented Intervention:
FAI-VK (Verbalized Knowledge): They ask an LLM (like GPT-4) to access its internal training data, determine the factual demographic distribution, and write a specific instruction for the image generator.
FAI-RK (Retrieved Knowledge): This is the more robust version. The system retrieves actual Wikipedia documents about the event, summarizes the demographic facts, and then constructs the prompt for the image generator.
The prompts used for these interventions are explicit and grounded in evidence.

Did it work?
Yes. Looking back at Table 2, the rows labeled + FAI-RK show the highest accuracy scores, often beating even the baseline.
- DALL-E 3 (Race): The FAI-RK method achieved an Involved Demographic Accuracy (IDA) of 69.46%, compared to 56.39% for the standard diversity intervention.
- Factual Diversity: Crucially, the FDD (Divergence) score dropped to nearly zero or became very low, meaning the diversity in the image matched the diversity of the actual history.
Figure 8 shows the side-by-side improvement. In the column for “Dutch Colonial Forces,” the diversity intervention incorrectly inserts women. The FAI-RK method correctly identifies that this was a male-only military force and generates a historically accurate image, while still maintaining high visual quality.

A Note on Model Limitations
Interestingly, the researchers could only apply FAI to DALL-E 3. When they tried to use these long, complex, fact-based prompts with Stable Diffusion, the model collapsed. As shown in Figure 9, Stable Diffusion struggled to process the long context, resulting in distorted, nightmare-fuel imagery. This highlights that for “smart” interventions to work, the underlying image model needs robust language understanding capabilities.

Conclusion and Implications
The “Factuality Tax” is a fascinating case study in the Law of Unintended Consequences in AI alignment. In our rush to fix the social biases of T2I models—to ensure they don’t only show us white male doctors—we inadvertently broke their ability to accurately represent history.
This paper makes three significant contributions to the field:
- DoFaiR: A benchmark that finally allows us to measure demographic factuality systematically.
- Quantification: Proof that current diversity interventions degrade historical accuracy by over 11%.
- FAI: A proven method using Retrieval Augmented Generation (RAG) principles to align image generation with historical fact.
The broader implication is clear: context matters. A “one-size-fits-all” diversity prompt is too blunt an instrument for models that need to navigate the complexities of human history. By coupling image generators with knowledge retrieval systems (FAI-RK), we can move toward AI that is both socially responsible and historically faithful. We can have diversity where it reflects reality, and accuracy where history demands it.
](https://deep-paper.org/en/paper/2407.00377/images/cover.png)