Teaching CLIP to See—How Stable Diffusion Can Fix Vision-Language Reasoning

If you have dabbled in computer vision or multimodal AI recently, you have undoubtedly encountered CLIP (Contrastive Language-Image Pre-training). Since its release by OpenAI, CLIP has become the backbone of modern AI image systems. It powers zero-shot classification, image retrieval, and serves as the “eyes” for many generative pipelines.

But CLIP has a secret weakness.

While it is excellent at recognizing objects (knowing that an image contains a “dog” and a “couch”), it is surprisingly bad at understanding the relationship between them. If you show CLIP an image of a dog sitting on a couch and ask it to distinguish between “a dog on a couch” and “a couch on a dog,” it often guesses randomly. This phenomenon acts as a “bag-of-words” model—it checks for the presence of words but ignores the syntax and spatial structure that give a sentence its specific meaning.

In this post, we are diving deep into a fascinating paper: “Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP.”

The researchers propose a clever solution: SDS-CLIP. They take the rich, structural understanding locked inside generative models like Stable Diffusion and “distill” it into the lightweight, fast architecture of CLIP. The result? A model that retains the speed of CLIP but gains significant reasoning capabilities.

Let’s break down how this works, the math behind it, and why it matters.

The Context: The Blind Spot of Contrastive Models

To understand the solution, we first need to understand the problem. CLIP is trained using a contrastive objective. It pulls the vector embedding of an image close to the embedding of its corresponding text caption, while pushing away incorrect captions.

Mathematically, the standard CLIP loss looks like this:

The standard CLIP loss function is a combination of image-to-text and text-to-image contrastive losses.

Here, \(L_{image-text}\) and \(L_{text-image}\) are calculated using softmax functions over the similarity scores of the embeddings:

The detailed equations for CLIP’s contrastive loss, showing the summation over positive and negative pairs.

This training method is fantastic for learning general semantics. However, it encourages the model to find “shortcuts.” To minimize this loss, CLIP only needs to know that the words “dog” and “couch” appear in the text and the corresponding objects appear in the image. It doesn’t necessarily need to learn where they are relative to each other.

The Generative Alternative

On the other side of the spectrum, we have Text-to-Image generative models like Stable Diffusion. Because these models must generate pixels from scratch based on a prompt, they have to understand composition. You cannot generate a convincing image of “an astronaut riding a horse” without understanding who is on top of whom.

Recent research has shown that you can actually use Stable Diffusion as a classifier. By measuring the “denoising score” (how well the model thinks an image matches a text prompt), Stable Diffusion achieves state-of-the-art performance on difficult reasoning benchmarks like Winoground.

Winoground is a dataset designed specifically to trick models like CLIP. It contains pairs of images and captions that use the exact same words but in different orders (e.g., “water pouring into a cup” vs. “a cup pouring into water”).

The Trade-off

So, why don’t we just use Stable Diffusion for everything? Speed.

Calculating a classification score using a diffusion model requires running a massive U-Net iteratively, often dozens of times. As shown in the chart below, while the “Diffusion Score” (the orange star) achieves high accuracy on Winoground, it is excruciatingly slow—taking nearly 18x longer than CLIP.

A scatter plot comparing Winoground Scores vs. Time. CLIP variants are clustered at the bottom left (fast but lower scores), while the Diffusion Score is at the top right (high score but very slow).

The goal of this paper is to get to the top-left of this graph: High reasoning accuracy (like Diffusion) with high speed (like CLIP).

The Core Method: SDS-CLIP

The researchers introduce SDS-CLIP (Score Distillation Sampling for CLIP). The intuition is simple: treat Stable Diffusion as a “teacher” and CLIP as a “student.” We want to fine-tune CLIP so that its embeddings align with what Stable Diffusion “knows” about the image, without actually having to run the heavy diffusion model at inference time.

To do this, they adapt a technique called Score Distillation Sampling (SDS), which was originally famous for generating 3D objects from 2D diffusion models (like in the DreamFusion paper).

Step 1: The Diffusion Score

First, let’s look at how a diffusion model measures the fit between an image \(x\) and a caption \(c\). This is called the denoising diffusion score, denoted as \(d(x, c)\). It measures the error between the predicted noise and the actual noise added to an image:

Equation defining the denoising diffusion score as the expected squared difference between the predicted noise and the actual noise.

In this equation:

\(\epsilon_{\theta}\) is the U-Net (the noise predictor).
\(v_{\alpha}(x)\) is the encoder that turns the image into latent space (like a VQ-VAE).
\(t\) is the timestep (how much noise is added).
\(\epsilon\) is the actual Gaussian noise.

If we were using the diffusion model directly for classification, we would pick the caption \(c^*\) that minimizes this error:

Equation showing how to select the best caption by minimizing the diffusion score.

Step 2: Connecting CLIP to the Teacher

The researchers want CLIP to learn this score. But CLIP’s image encoder (\(f_{\phi}\)) outputs a 1D vector (the classification token), while Stable Diffusion’s U-Net expects a spatial feature map (usually \(4 \times 64 \times 64\)).

To bridge this gap, they introduce a simple linear map, \(h_w\), which projects the CLIP embedding into the input space of the Stable Diffusion U-Net.

Step 3: The Distillation Loss (\(L_{SDS}\))

Here is the crucial innovation. instead of just training CLIP on image-text pairs with the standard contrastive loss, they add a new regularization term. This term penalizes CLIP if its embeddings (when projected into the U-Net) result in a high error in the diffusion model.

In other words, CLIP is updated to minimize the noise prediction error of Stable Diffusion. This forces CLIP to pay attention to the features that Stable Diffusion cares about—namely, structure and composition.

The SDS loss is defined as:

The SDS loss equation, which looks similar to the diffusion score but uses the mapped CLIP features as input to the U-Net.

Notice that the input to the U-Net \(\epsilon_{\theta}\) is now \(h_w(f_{\phi}(x))\)—the projected CLIP embedding—rather than the standard image latents.

Step 4: The Combined Objective

The final training objective combines the original CLIP contrastive loss (to keep the general semantic knowledge) with the new SDS loss (to inject compositional reasoning):

The total loss function is the sum of CLIP loss and the weighted SDS loss.

Here, \(\lambda\) is a hyperparameter that balances the two objectives.

The optimization process finds the best parameters for CLIP’s image encoder (\(\phi\)), text encoder (\(\gamma\)), and the linear map (\(w\)). Crucially, the U-Net parameters (\(\theta\)) are frozen. We are only using the U-Net to calculate gradients to update CLIP.

Equation showing the minimization of the total loss with respect to CLIP parameters, keeping the U-Net frozen.

Efficiency Note

Training a whole CLIP model is expensive. To make this efficient, the authors do not fine-tune all of CLIP. Instead, they only update the LayerNorm parameters and the new linear map. This involves a tiny fraction of the total parameters (only about 8 million), making the process computationally cheap and less prone to overfitting on the small dataset used for fine-tuning (MS-COCO).

Experiments and Results

Does borrowing brains from a generative model actually work? The results suggest a resounding yes.

1. Performance on Winoground

The primary benchmark for this study is Winoground, the “boss fight” for visio-linguistic reasoning.

Table 1: SDS-CLIP improves CLIP performance on Winoground by 1.5% to 7% across various CLIP backbones.

Looking at Table 1 (above), we can see:

ViT-B/16 (CLIP): The baseline score is 0.24.
FT with \(L_{CLIP}\): If you just fine-tune CLIP on the MS-COCO dataset using standard loss, the score actually drops to 0.23. This proves that simply seeing more data isn’t enough; the way the model learns matters.
FT with \(L_{CLIP} + L_{SDS}\) (Ours): When using the SDS distillation, the score jumps to 0.31.

This is a massive relative improvement (up to 7% absolute gain). The model significantly improves in “Object” and “Relation” categories, meaning it is finally starting to understand what is interacting with what.

2. Performance on ARO

They also tested on the ARO dataset, which checks for attribute binding (e.g., “red cube, blue sphere” vs. “blue cube, red sphere”). SDS-CLIP showed improvements of 1–3% here as well. Interestingly, the method didn’t help much with simple text ordering tasks, suggesting that while the visual reasoning improved, the grammatical understanding of the text encoder remains a limiting factor.

3. Zero-Shot Capabilities

One major risk of fine-tuning a foundation model like CLIP is “catastrophic forgetting.” You might teach it to understand complex sentences better, but it might lose its ability to recognize a generic “goldfish” or “airplane.”

To check this, the authors evaluated SDS-CLIP on standard classification benchmarks (ImageNet, CIFAR, etc.).

Figure 2: Radar charts showing that SDS-CLIP (red) matches or exceeds the zero-shot performance of standard CLIP (blue) on downstream datasets.

As shown in Figure 2, the red lines (Ours) almost perfectly overlap or extend beyond the blue lines (Baseline CLIP).

For ViT-B/16 (Chart a), they actually observed a 1-8% improvement on datasets like MNIST and ImageNet.
For larger models, the performance remained stable.

This confirms that SDS-CLIP is a strict upgrade: better reasoning without sacrificing general knowledge.

Conclusion and Key Takeaways

The paper “Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP” highlights a growing trend in AI: the convergence of generative and discriminative models.

Here are the big takeaways for students and practitioners:

Generative Models Know More: Models trained to create images (like Stable Diffusion) learn structural and spatial relationships that contrastive models (like CLIP) often ignore.
Distillation is Powerful: We don’t always need to deploy the massive generative model. We can use it as a teacher to improve smaller, faster models.
The Objective Function Matters: Simply fine-tuning on better data (MS-COCO) didn’t fix CLIP’s reasoning. Changing the loss function to include the SDS term is what unlocked the improvement.
Parameter Efficiency: You can achieve significant gains by training only a tiny fraction of the network (LayerNorms), preserving the pre-trained “muscle memory” of the original model.

SDS-CLIP represents a promising step toward multimodal models that don’t just recognize keywords, but actually “see” and “understand” the scene, all while maintaining the speed required for real-world applications.

The Context: The Blind Spot of Contrastive Models#

The Generative Alternative#

The Trade-off#

The Core Method: SDS-CLIP#

Step 1: The Diffusion Score#

Step 2: Connecting CLIP to the Teacher#

Step 3: The Distillation Loss (\(L_{SDS}\))#

Step 4: The Combined Objective#

Efficiency Note#

Experiments and Results#

1. Performance on Winoground#

2. Performance on ARO#

3. Zero-Shot Capabilities#

Conclusion and Key Takeaways#