Introduction: The Motorcycle Problem

Imagine showing an AI model a picture of a person riding a motorcycle. You ask the model to describe what it sees. It replies: “A man riding a motorcycle.”

Now, imagine that the rider is actually a woman. Why did the AI get it wrong?

The answer lies in spurious correlations. In the vast datasets used to train these models, motorcycles appear significantly more often with men than with women. The model stops looking at the person and starts relying on the context: if there is a motorcycle, the model bets it’s a man.

This is a classic example of societal bias in machine learning. While researchers have spent years trying to fix this by re-balancing datasets (a technique called resampling), a new paper titled “Resampled Datasets Are Not Enough: Mitigating Societal Bias Beyond Single Attributes” argues that traditional methods are hitting a wall. They fail because they only look at labeled attributes (like “motorcycle”), ignoring the thousands of subtle, unlabeled clues—like background colors, spatial positioning, or clothing styles—that models use to cheat.

In this post, we will explore how researchers are now using generative AI (specifically text-guided inpainting) to create “counterfactual” synthetic datasets. We will break down their novel pipeline, explain why mixing real and fake data can backfire, and see how this approach creates fairer models for everyone.


The Limitations of Resampling

Before diving into the solution, we need to understand why the current standard—resampling—is insufficient.

In a typical dataset, you might have an imbalance. For example:

  • 90% of images with “kitchens” feature women.
  • 10% of images with “kitchens” feature men.

A model trained on this learns that Kitchen = Woman. To fix this, researchers use resampling: they over-sample the minority case (show the “men in kitchens” images multiple times) or under-sample the majority case (show fewer “women in kitchens”).

The goal is to make the probability of the attribute (kitchen) independent of the protected group (gender).

The “Hidden Attribute” Trap

Resampling works great if you have labels for everything. But what about the attributes you didn’t label?

Consider color. Perhaps in the dataset, images of women tend to have warmer color palettes, while images of men have cooler tones. Even if you balance the “kitchen” label perfectly, the model might simply switch to using color statistics to guess gender. Because “color palette” isn’t a labeled category in the dataset, you can’t resample based on it.

This is the core problem the researchers address: How do we decorrelate protected groups from all attributes, even the ones we don’t know about?


The Solution: Text-Guided Inpainting

The researchers propose a method that moves beyond curating existing data to generating new data. If the real world doesn’t provide enough pictures of men in contexts typically associated with women (or vice versa), we can use modern diffusion models to create them.

The core idea is Text-Guided Inpainting.

  1. Take an image with a person.
  2. “Mask” (erase) the person.
  3. Use a text prompt to fill the empty space with a specific demographic (e.g., “A man…” or “A woman…”).

By doing this for every image in the dataset, you theoretically ensure that the background context (the kitchen, the motorcycle, the colors) remains exactly the same, while the gender changes. This breaks the correlation between the context and the group.

The Pipeline Overview

The researchers developed a comprehensive pipeline to operationalize this idea.

Overview of the pipeline showing input images, inpainting, filtering, and dataset creation.

As shown in Figure 2 above, the process works as follows:

  1. Input: An original image \(x\).
  2. Inpainting: The system generates multiple variations of the person using different prompts (e.g., \(t_{man}\) and \(t_{woman}\)).
  3. Filtering & Ranking: Generative models aren’t perfect. They sometimes create monsters or ignore instructions. The system generates \(m\) candidates and filters them to find the best ones.
  4. Dataset Creation: The best synthetic images are compiled into a new training set.

Quality Control: The Three Filters

One of the major contributions of this paper is the acknowledgment that you cannot simply trust a diffusion model blindly. If you ask it to generate “A woman with a motorcycle,” it might generate a woman but accidentally remove the motorcycle, or change the lighting so drastically that the image looks fake.

To solve this, the authors introduce a rigorous ranking system based on three specific equations.

1. Prompt Adherence

First, we must ensure the generated image actually matches the text description (e.g., did it actually generate a woman?). They use CLIPScore, which measures the semantic similarity between the image and the text.

Equation for prompt adherence using CLIP embeddings.

Here, \(\phi\) is the image encoder and \(\psi\) is the text encoder. A higher score means the image better reflects the prompt “A woman with a motorcycle.”

2. Object Consistency

Second, we need to ensure the model didn’t hallucinate new objects or delete existing ones. If the original image had a surfboard, the synthetic image must also have a surfboard. They use a pre-trained object detector to compare the objects found in the synthetic image (\(x_{synthetic}\)) vs. the original image (\(x_{original}\)).

Equation for object consistency using F1 score.

They calculate the F1 score between the detected objects in both images. If the score is high, the semantic content of the scene has been preserved.

3. Color Fidelity

Finally, generative models often introduce their own biases in color (e.g., making images of women brighter or more pastel). To prevent the model from learning these new color-based biases, the researchers compare the color statistics of the original and synthetic images.

Equation for color fidelity using Frobenius norm.

They down-sample the images to \(14 \times 14\) pixels (to focus on general color palette rather than detail) and calculate the difference using the Frobenius norm. We want the difference to be minimized (or the inverse score maximized).

Seeing the Filters in Action

Why are these filters necessary? The image below shows what happens when inpainting goes wrong versus when it goes right.

Comparison of best and worst inpainted images across different filter criteria.

In Figure 4, look at the “Object Consistency” column. The “Worst” image completely loses the object the person was holding. If we trained on that data, the model might learn that “Object X disappears when a woman is present,” introducing a new bias. The “Best” row, selected by the filters, preserves the pose, object, and lighting accurately.


Building the Dataset: Synthetic vs. Augmented

Once the images are generated and filtered, how do we use them? The paper explores two strategies.

Strategy 1: Augmentation (\(S_{\text{augment}}\))

The intuitive approach is to keep the real data and add the synthetic data to balance it. Equation for S_augment combining real and synthetic data. Here, the dataset includes the original real images \(\mathcal{D}\) plus the synthetic counterfactuals for the groups that weren’t present.

Strategy 2: Pure Synthetic (\(S_{\text{synthetic}}\))

The radical approach is to throw away the real images of people entirely. Equation for S_synthetic using only synthetic data. In this strategy, every image used for training is a synthetic generation (derived from a real image). If you have a photo of a man on a bike, you generate a synthetic man on a bike and a synthetic woman on a bike, and train on those. You do not use the original photo.

Why would you discard real data? Keep reading to find out—it is the paper’s most surprising finding.


Experiments and Results

The researchers tested their methods on two major tasks: Multi-label Classification (identifying objects in a scene) and Image Captioning (describing the scene). They used the COCO dataset, a standard benchmark in computer vision.

Visual Results

First, let’s look at the qualitative difference.

Comparison of predictions by baseline models versus the proposed method.

In Figure 1:

  • Image 1 (Bench): The baseline model sees a woman and predicts “Handbag” even though there isn’t one. It’s a hallucinated bias. The proposed method correctly sees only the person and the bench.
  • Image 3 (Frisbee): The baseline sees a person jumping and predicts “Skateboarding,” a sport heavily correlated with men in the dataset. The proposed method correctly identifies the “Frisbee.”

Quantitative Results

The table below details the performance on classification tasks. They measured mAP (accuracy) and two bias metrics: Ratio (how much the prediction skews toward a specific gender, ideal is 1) and Leakage (how much the model reveals about the protected group when it shouldn’t).

Table 1 showing classification performance and bias scores.

Key Takeaways from Table 1:

  1. Original Data: High bias. The Ratio is 6.3, meaning it is heavily skewed.
  2. Over-sampling (Traditional method): Improves bias slightly (Ratio 3.8) but hurts accuracy (mAP drops from 66.4 to 62.6).
  3. \(S_{\text{synthetic}}\) (Ours): This is the winner. The Ratio drops to 1.1 (almost perfect fairness), and the mAP stays high at 66.0.

The purely synthetic dataset achieved state-of-the-art bias reduction without destroying the model’s utility.


The “Synthetic Artifact” Trap

You might have noticed in Table 1 that \(S_{\text{augment}}\) (mixing real and fake) performed worse on bias metrics than \(S_{\text{synthetic}}\). Why?

This is a critical lesson in using Generative AI.

Even with great filters, synthetic images have subtle “artifacts”—tiny pixel-level glitches or smoothness that distinguish them from real photos.

If you create a dataset with:

  • Real photos of Men on motorcycles
  • Synthetic photos of Women on motorcycles

The AI model is smart (and lazy). It will realize: “I don’t need to look for a woman. I just need to look for synthetic pixel artifacts. If the image looks fake, predict ‘Woman’. If it looks real, predict ‘Man’.”

The researchers proved this hypothesis by testing the models on a special test set.

Original vs Inpainted test images showing inconsistent predictions.

In Figure 3, look at the predictions under \(S_{\text{augment}}\).

  • On the Original image (left), it predicts “A man.”
  • On the Inpainted version of the same image (right), it predicts “A woman,” even though the visual content implies a man.

The model trained on mixed data (\(S_{\text{augment}}\)) learned to associate the “inpainted look” with the minority group.

The Solution: By using \(S_{\text{synthetic}}\), all images—men and women—have the same synthetic artifacts. The model can no longer use “fake-ness” as a shortcut to guess gender. This levels the playing field.


Conclusion and Implications

This paper presents a significant shift in how we think about “data cleaning.” Instead of just re-weighing the data we have, we are entering an era where we can manufacture the data we need to reflect the fairness we want.

Key Takeaways for Students:

  1. Labels aren’t enough: You can’t fix bias just by balancing labels because bias hides in unlabeled features (backgrounds, colors).
  2. Generative AI is a tool for fairness: Inpainting allows us to change specific attributes while freezing the context, effectively isolating the variable we want to de-bias.
  3. Beware of Artifacts: Mixing real and synthetic data is dangerous. It creates a new spurious correlation where “synthetic” becomes a proxy for the minority class.
  4. Filters are mandatory: You cannot trust raw generative output. Rigorous mathematical filtering (CLIPScore, etc.) is essential for data quality.

As generative models improve, this “synthetic data” approach will likely become a standard part of the ML pipeline, allowing us to train models that are not just accurate, but socially responsible.