Introduction
Imagine you are using a generative AI tool to edit a photo of a young person. You adjust the “Age” slider to make them look older. The model successfully adds wrinkles and greys the hair, but strangely, it also puts a pair of glasses on the person’s face. You didn’t ask for glasses. You try again, and the same thing happens.
Why does this happen?
The answer lies in spurious correlations. In many training datasets (like CelebA), older people are statistically more likely to wear glasses than younger people. A standard deep learning model doesn’t understand biology or optics; it simply memorizes patterns. It learns that “Old” and “Glasses” often go together, so when you ask for one, it gives you both.
This is a fundamental problem in Controllable Image Editing. We want to modify specific attributes (like age, gender, or expression) independently, without inadvertently changing unrelated features.
In this post, we will dive deep into a paper titled “Visual Representation Learning through Causal Intervention for Controllable Image Editing,” which proposes a novel framework called CIDiffuser. This method bridges the gap between Causal Inference—the study of cause and effect—and Diffusion Models, the current state-of-the-art in image generation.
We will explore how the authors use structural causal models (SCMs) to disentangle these complex relationships, ensuring that when you edit “Age,” the “Glasses” stay put.
The Problem: Correlation vs. Causation
To understand the solution, we first need to visualize the problem. In the field of representation learning, we try to map an image to a set of latent variables (representations) that describe its content.
Ideally, these variables should be independent. The variable for “Age” should not interact with the variable for “Glasses.” However, real-world data is messy.

As shown in Figure 1 (a) above, there is a “Confounder” (represented by \(U_{12}\) or \(U_{34}\)). This is an underlying bias in the dataset.
- Fact: In the dataset, older people often wear glasses.
- Result: The model learns a spurious correlation (dashed line).
- Consequence: When you perform an edit (like in the red box in Figure 1.a.2), the model hallucinates glasses because it mistakenly believes they are causally caused by age.
Existing methods like VAEs (Variational Autoencoders) or GANs (Generative Adversarial Networks) often fail here because they focus on fitting the data distribution, which includes these biases. While recent diffusion models generate beautiful images, they are also prone to these entanglement issues if the underlying representation isn’t carefully structured.
The Solution: The CIDiffuser Framework
The researchers propose CIDiffuser, a framework that marries the image quality of diffusion models with the logic of causal inference.
The high-level architecture is displayed below. It might look complex, but we will break it down step-by-step.

The pipeline operates in three main stages:
- Causal Modeling: Encoding the image into “Semantic” (meaningful) and “Stochastic” (random) parts.
- Direct Causal Effect Learning: Using causal interventions to learn how attributes truly affect one another, ignoring dataset bias.
- Causal Representation-Diffusion: generating the final image using the cleaned-up representations.
1. Decomposing the Image
Not everything in an image is a high-level attribute. The shape of a nose is a semantic attribute; the specific grain of the skin texture might be stochastic noise.
CIDiffuser starts by decomposing the input image \(x_0\) into two distinct representations:
- High-level Semantic Representation (\(\varepsilon_{semantic}\)): These encode the core attributes we care about (Age, Gender, Smile).
- Low-level Stochastic Representation (\(x_T\)): These encode texture, noise, and unstructured details.
This separation allows the model to manipulate the “logic” of the image (the semantics) without destroying the realistic “look” (the stochastic details).
2. Structural Causal Models (SCM)
Once we have the semantic representation, we cannot just use it directly, or we fall back into the correlation trap. We need to structure it.
The authors employ a Structural Causal Model (SCM). In an SCM, variables are nodes in a graph, and edges represent causal links. If node \(A\) points to node \(B\), then \(A\) causes \(B\).
The model learns a set of functions \(F\) and a causal adjacency matrix \(\mathbf{A}\). The matrix \(\mathbf{A}\) tells us the strength of the causal relationship between attributes. For example, if \(A_{ij} = 0\), there is no causal link between attribute \(i\) and attribute \(j\).
To make this mathematically robust, the authors define the relationship between the latent noise terms and the causal representations \(z_i\) using a non-linear function:

Here, \(f(\cdot)\) is a piece-wise linear function that transforms the raw latent variables into causally structured ones. This nonlinearity is crucial because real-world causal relationships are rarely simple straight lines.
3. Direct Causal Effect Learning (DCEL)
This is the core innovation of the paper. How do we teach the model to ignore the fact that “Old people usually wear glasses” and learn that “Age does not cause Glasses”?
We use Causal Intervention, often referred to as Judea Pearl’s do-operator.
The “Do” Operator
Standard probability asks: “Given that I see someone is old, what is the probability they have glasses?” (\(P(Glasses | Age)\)). This includes all the dataset bias.
Causal intervention asks: “If I force this person to be old (intervene), what is the probability they have glasses?” (\(P(Glasses | do(Age))\)). By forcing the value, we cut the link to the confounding history.
The authors quantify this using the Direct Causal Effect (DCE). The DCE measures the difference in the outcome when we intervene on a variable versus when we don’t, while keeping confounding factors constant.

In this equation:
- \(\hat{Y}_{y_i}^j\) is the prediction when attribute \(i\) is set to its observed value.
- \(\hat{Y}_{\bar{y}_i}^j\) is the prediction when attribute \(i\) is intervened upon (set to a different value, \(\bar{y}_i\)).
- The difference between these two isolates the direct effect, stripping away the confounding bias \(U_{ij}\).
The Causal Loss Function
To train the model to respect these causal rules, the researchers design a specific loss function. They train a classifier \(C_{ij}\) to predict attribute relationships.

The first term (BCE) ensures the model can predict relationships accurately. The second term (subtracted) ensures that if we flip the cause (intervene), the effect changes. This ensures the learned causal link \(A_{ij}\) is not zero.
Handling Imbalanced Data
There is one more hurdle: Class Imbalance. In CelebA, there are far more “Non-Chubby” people than “Chubby” people. This imbalance can skew causal estimation.
To fix this, the authors upgrade their loss function using Influence Functions and class-wise re-weighting.

- \(\gamma_k\) is a re-weighting term that gives more importance to rare classes.
- The denominator \(IB_{ij}\) (Influence Balance) measures how much a training sample influences the classifier, preventing easy, majority-class examples from dominating the training.
4. Diffusion-Based Decoding
We now have a clean, causally disentangled representation \(z_{causal}\). How do we turn this back into an image?
This is where the Diffusion Model shines. While VAEs (used in previous causal works) often produce blurry images due to the information bottleneck, Diffusion models iteratively denoise a signal to produce sharp, high-fidelity results.
The CIDiffuser uses a Denoising Diffusion Implicit Model (DDIM). It takes the stochastic representation \(x_T\) and gradually removes noise, conditioned on our clean causal representation \(z_{causal}\).
The reverse diffusion process (generating the image) is defined as:

And the specific denoising step is governed by:

Here, \(\epsilon_{\theta}\) is a U-Net that predicts the noise to be removed at each step. Crucially, this U-Net is guided by \(z_{causal}\) via Adaptive Group Normalization (AdaGN) layers. This fuses the causal information into the image generation process at every level of detail.
The Total Learning Objective
The model is trained end-to-end. The total loss function combines the standard diffusion loss (ELBO), a supervised loss to ensure attributes are predicted correctly, and the causal imbalance loss we discussed earlier.


This composite loss ensures three things simultaneously:
- High Fidelity: The image looks real (\(\mathcal{L}_{diff}\)).
- Semantic Correctness: The attributes match the labels (\(\mathcal{L}_{s}\)).
- Causal Independence: The attributes are disentangled and free of spurious correlations (\(\mathcal{L}_{d}^{imb}\)).
Experiments and Results
Does this complex architecture actually work? The authors tested CIDiffuser on two datasets: CelebA (faces) and Pendulum (a synthetic physics dataset).
Qualitative Results: The Eye Test
Let’s look at the editing results on CelebA. The goal here is to change one attribute without breaking others.

In Figure 3, look at the column “DiffAE” (a standard diffusion autoencoder without causal reasoning).
- When editing “Age” (Row 1), DiffAE adds glasses (Red box). The spurious correlation strikes again.
- When editing “Gender” (Row 2), other methods struggle to maintain identity or introduce artifacts.
Now look at the “CIDiffuser” column (far right). When editing age, the person becomes older, but no glasses appear. The causal intervention successfully broke the link. The edits are clean and precise.
The results on the Pendulum dataset are equally telling. This dataset involves a pendulum casting a shadow. The shadow’s position must be caused by the light position and pendulum angle.

In Figure 4, you can see that other methods (like DEAR) often fail to move the shadow correctly when the light moves (Green/Blue circles), or they blur the image. CIDiffuser maintains the physical laws governing the scene because it learned the causal graph of the mechanism.
Quantitative Results: The Metrics
The authors used several metrics to measure success:
- TAD (Total AUROC Difference): A metric for disentanglement. Higher is better.
- FID (Fréchet Inception Distance): A metric for image realism. Lower is better.

In Table 1, looking at the CelebA-smile and CelebA-age columns, CIDiffuser achieves the highest MIC and TIC scores (measures of information capture), significantly outperforming purely generative models like DiffAE and purely causal models like CFI-VAE.
Furthermore, let’s look at the trade-off between representation quality and dimensions.

The chart on the right of Figure 5 is interesting. It shows that as the dimension of the causal representation increases (x-axis), the disentanglement (TAD, blue line) improves up to a point (around 64 dimensions) before noise takes over. This confirms that the model is learning a compact, meaningful representation of the visual world.
Conclusion
The CIDiffuser paper represents a significant step forward in making AI tools more reliable and controllable. By refusing to accept “correlation” as “causation,” the model overcomes the biases inherent in training data.
Here are the key takeaways:
- Decomposition: Splitting images into semantic and stochastic components allows for precise editing without losing texture details.
- Causal Intervention: Using the “do-operator” during training allows the model to unlearn dataset biases (like “Old = Glasses”).
- Diffusion Power: Integrating this causal framework with diffusion models ensures the final output is photorealistic, fixing the blurriness issues of previous causal VAEs.
As generative AI continues to evolve, techniques like this will be essential. We don’t just want models that can generate any image; we want models that can generate the exact image we envision, respecting the causal logic of the real world.
](https://deep-paper.org/en/paper/file-2291/images/cover.png)