Introduction
In the rapidly evolving landscape of Artificial Intelligence, Vision-Language Models (VLMs) have become superstars. Models like CLIP or GLIP can look at an image and describe it, or read a text description and find the corresponding object in a picture. They are powerful tools, pre-trained on massive datasets of image-text pairs scraped from the internet.
However, this power comes with a significant catch: societal bias. Because these models learn from human-generated data, they often inherit our stereotypes. For example, a model might be more likely to associate a “kitchen” with a woman or a “workshop” with a man, regardless of who is actually in the picture.
Traditionally, researchers have tried to measure this bias by treating the model as a “black box.” They tweak the input (e.g., changing “man” to “woman” in a caption) and see how the output probability changes. While this tells us that bias exists, it fails to explain how or where that bias is generated inside the neural network.
This brings us to a fascinating paper titled “Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective.” The researchers in this study didn’t just want to measure the output; they wanted to open the hood. They proposed a framework using Causal Mediation Analysis to map the pathways of bias generation.
Their findings are surprising: contrary to what one might expect from text-heavy biases, the image features contribute significantly more to the model’s bias than the text features. In this post, we will tear down their methodology, explore the internal mechanics of bias, and look at their proposed solution for mitigation.
Background: The Problem with Black-Box Bias Detection
Before diving into the new method, we need to understand the status quo. Most current bias evaluation methods in VLMs are derived from Natural Language Processing (NLP).
For instance, a common technique is “counterfactual testing.” If a model predicts a person is a “doctor” with 90% confidence when the text says “he,” but drops to 60% confidence when the text says “she,” we identify a gender bias.
While useful, this approach has two major limitations:
- Lack of Causality: It shows a correlation between input and output but doesn’t explain the causal mechanism.
- Opaque Internal Flow: It doesn’t tell us which part of the model is responsible. Is it the image encoder? The text encoder? Or the fusion layer where they mix?
Without knowing where the bias lives, fixing it is a guessing game. The authors of this paper argue that to effectively mitigate bias, we first need to understand the causal roles of specific model components.
The Core Method: Causal Mediation Analysis
The heart of this paper is the application of Causal Mediation Analysis to Vision-Language Models. This is a statistical framework used to understand how an independent variable (the treatment) affects a dependent variable (the outcome) through an intermediate variable (the mediator).
The Intuition
To understand this, let’s look at a real-world analogy provided by the authors.

As shown in Figure 1, imagine an athlete doing strength training (Treatment \(X\)) to improve athletic performance (Outcome \(Y\)).
- Direct Effect: The training directly improves performance.
- Indirect Effect (via Mediator): The training requires muscle relaxation (Mediator \(Z\)) to recover, which in turn affects performance.
The researchers apply this logic to VLMs:
- Treatment (\(X\)): Altering the gender information in the input (e.g., masking a person in an image or swapping gender words in text).
- Mediator (\(Z\)): Specific components inside the AI model, such as attention heads or specific layers.
- Outcome (\(Y\)): The change in the model’s gender bias score.
Defining the Bias Metric
Before measuring causal effects, the authors needed a concrete way to quantify bias. They defined a metric called \(BIAS_{VL}\) (Vision-Language Bias).

This equation measures the difference in correlation between an object (like a “stove” or “motorcycle”) and different genders. Specifically, they use the False Positive Rate (FPR). If a model incorrectly detects a “hair dryer” more often when a woman is in the picture than when a man is in the picture, that discrepancy contributes to the \(BIAS_{VL}\) score.
The Experimental Setup: GLIP
The researchers used the GLIP (Grounded Language-Image Pre-training) model for their experiments. GLIP is an object detection model that takes two inputs:
- An image.
- A text prompt (a list of object categories).

As seen in Figure 3, the model processes the image (green box) and the text (red box) separately before merging them in a “Deep Fusion” module (purple box). This architecture is perfect for this study because it has distinct modules for vision and language, allowing the researchers to isolate where the bias is coming from.
Measuring Direct and Indirect Effects
Here is where the methodology gets clever. The authors used three types of interventions:
- Null: Original image and text.
- Replace-gender: Swapping “man” with “person” (or similar) in the text.
- Mask-gender: Blacking out the pixels corresponding to the person in the image.
By freezing specific parts of the model (the mediators) while changing the inputs, they could mathematically separate the Direct Effect (how much the input change affects the output directly) from the Indirect Effect (how much the input change affects the output through a specific internal component).

Figure 2 visualizes this flow.
- (c) Direct Effect: We change the input, but we force the internal component (\(z\)) to behave as if the input hadn’t changed.
- (d) Indirect Effect: We keep the input the same, but we force the internal component (\(z\)) to behave as if the input had changed.
This allows the researchers to ask: “Does the bias come from the raw input data, or is it being amplified and propagated by the Image Encoder layers specifically?”
Experiments & Results
The team tested this framework on two major datasets: MSCOCO and PASCAL-SENTENCE. They focused on 66 objects that frequently appear with humans.
1. Confirming the Bias Exists
First, they established a baseline. Does the GLIP model actually exhibit gender bias? The answer is a resounding yes.

Figure 4 shows the False Positive Rates for the PASCAL-SENTENCE dataset.
- Blue bars (Female): Notice the spikes for “dining table” and “chair.” The model is hallucinating these objects more often simply because a woman is present.
- Orange bars (Male): Notice the spikes for “motorbike,” “bus,” and “car.” The model associates men with vehicles.
This confirms that the model correlates females with indoor objects/pets and males with outdoor objects/vehicles.
2. The Verdict: Images are the Primary Culprit
This is the most significant finding of the paper. When they ran the Causal Mediation Analysis, they compared the impact of the Image Encoder versus the Text Encoder.

Let’s look closely at Figure 5, specifically panels (a) and (b).
- Panel (a) - Image Module: When the researchers intervened on the image module (masking gender), the Direct Effect (DE) drops significantly, and the Indirect Effect (IE) rises as they include more layers. This means the image encoder layers are actively mediating a large portion of the bias.
- Panel (b) - Text Module: The effects here are much smaller (note the scale of the Y-axis).
The Quantified Difference: On the MSCOCO dataset, image features accounted for approximately 33% of the bias, whereas text features only accounted for 13%.
Why? The authors suggest that because the text input in object detection is often just a list of words (e.g., “teddy bear . handbag . fork”), the semantic structure is simple. Images, however, contain rich, complex pixel data where gender cues are deeply entangled with object cues.
3. The Deep Fusion Module
The researchers also analyzed the Deep Fusion encoder, where image and text features interact. They found that even though this module doesn’t extract features directly from raw inputs, the interaction process generates a massive amount of bias—accounting for about 56% of the contributions found in the encoders.
4. Do Vision and Language Agree?
A critical question in multimodal learning is whether the vision part and the language part of the model are fighting each other or reinforcing each other.

Figure 6 shows the results of intervening on Language (L), Vision (V), and both (L+V).
- The Red Line (L+V) shows the highest bias score when layers are fixed (meaning no intervention), and it drops the most when interventions are applied.
- The fact that combining them reduces bias more than either alone suggests that the biases are aligned. The text and image encoders are “conspiring” to reinforce the same gender stereotypes.
Mitigation: Fixing the Source
Since the causal analysis proved that the Image Encoder is the biggest contributor to bias, the authors proposed a targeted mitigation strategy called ImageFair.
The Strategy: “Blurring” Gender
Instead of retraining the whole model (which is expensive) or just changing text prompts (which is ineffective), they modified the Image Encoder pipeline:
- Face Detection: They integrated a lightweight network (MTCNN) to find faces in the input image.
- Gender Classification: They used MobileNet to classify the gender of the face.
- The Swap: If a male face is detected, they blend it with a “counterfactual” female face (and vice versa).
This process “blurs” the gender features in the image representation before the model can form a biased association.
Did it work?
The results were impressive.

As shown in Table 2:
- GLIP (Original): Bias score of 1.434 on MSCOCO.
- GLIP_TextFair: Reduced bias by ~7.8%.
- GLIP_ImageFair: Reduced bias by ~22.03%.
Crucially, look at the AP (Average Precision) column. The performance dropped very slightly (from 46.6 to 46.2). This indicates that the model became significantly fairer without losing its ability to detect objects accurately.
Conclusion and Implications
This research paper provides a crucial step forward in responsible AI. By moving beyond “black box” testing and using Causal Mediation Analysis, the authors demonstrated that images speak louder than words when it comes to bias in Vision-Language Models.
Key Takeaways:
- Traceability Matters: We can mathematically trace which layers and modalities are generating bias.
- Vision is Dominant: In object detection tasks, the visual features carry the bulk of the gender/object correlation.
- Targeted Mitigation: Because we know where the bias is (the Image Encoder), we can fix it efficiently (blurring gender faces) without breaking the rest of the model.
This framework is not limited to gender bias. The authors note that it could be adapted for age, race, or other societal biases, provided we have the right data. As we continue to integrate Multimodal AI into society, tools like this will be essential for ensuring these systems are fair and equitable.
](https://deep-paper.org/en/paper/2407.02814/images/cover.png)