Introduction

In the era of foundation models, CLIP (Contrastive Language-Image Pre-training) stands out as a watershed moment. Unlike the image classifiers of the past that were trained on specific categories (like ImageNet) and crumbled when shown something slightly different, CLIP exhibits a remarkable ability to understand concepts it has never explicitly seen during training. It can look at a photo, a sketch, or a painting and often understand that they all depict the same object.

This capability is known as Out-of-Distribution (OOD) generalization. The prevailing wisdom suggests that CLIP’s superpower comes from the sheer scale and diversity of its training data—400 million pairs of images and text scraped from the internet. But “diversity” is a vague term. What kind of diversity matters? And mechanically, what happens inside the neural network that allows it to bridge the gap between a photograph of a dog and a quick pencil doodle of one?

A recent research paper, “When and How Does CLIP Enable Domain and Compositional Generalization?”, dives deep into these questions. The researchers didn’t just accept that “more data is better.” Instead, they constructed controlled experiments to isolate specific factors of training data. They asked two fundamental questions:

  1. Domain Generalization: Can CLIP handle an entirely unseen art style (like a sketch) if it has seen many other styles?
  2. Compositional Generalization: Can CLIP recognize a “sketch of a dog” if it has seen “photos of dogs” and “sketches of cats,” but never a sketch of a dog together?

This post will walk you through their methodology, their surprising findings about when generalization fails, and their mechanistic analysis of the internal circuits that drive CLIP’s brain.

The Generalization Puzzle

Before diving into the experiments, we need to define the two types of hurdles the researchers set up for the model.

Domain Generalization (DG)

This is the ability to extrapolate. Imagine a model trained on photos, paintings, and cartoons. Can it successfully classify objects in a sketch, even though it has never seen a single sketch during training? If it can, it has achieved domain generalization. It has learned to ignore the “style” and focus on the “content.”

Compositional Generalization (CG)

This is the ability to recombine known concepts. Suppose the model has seen:

  • Object: “Dog” (in photos)
  • Domain: “Sketch” (containing cats, cars, and trees)

It knows what a dog looks like, and it knows what the “sketch” style looks like. Can it put them together to recognize a sketch of a dog? This is the “compositional” challenge—a longstanding goal in AI research.

The Experimental Setup: Controlling the Chaos

To study these phenomena, you cannot simply use a massive dataset like LAION-400M because it’s too messy; you never know exactly what the model has or hasn’t seen. The researchers built a controlled environment using ImageNet (mostly natural images) and DomainNet (images across 6 styles: Clipart, Infograph, Painting, Quickdraw, Real, and Sketch).

As shown in the figure below, they created four distinct training “diets” for the models to consume:

Training data setups and generalization performance. Panel A shows the four training setups: Natural-only, Leave-out-domain, CG low-diversity, and CG high-diversity. Panel B illustrates the conceptual performance gains.

  1. Natural-only: The baseline. Trained mostly on photos. This mimics standard ImageNet training.
  2. Leave-out-domain (DG): The model sees a diverse mix of domains (Paintings, Clipart, etc.) but is never shown the test domain (e.g., Sketches).
  3. CG Low-Diversity: The model sees photos + a specific subset of the test domain (e.g., Sketches of cats/cars, but not dogs).
  4. CG High-Diversity: The model sees a rich mix of many domains + a specific subset of the test domain.

The test set is always the same: unseen classes in the test domain (e.g., Sketches of dogs).

A Look at the Data

To appreciate the difficulty of this task, look at the visual variation in DomainNet. A “Quickdraw” (doodle) looks nothing like a “Real” photo, and an “Infograph” is structurally very different from a “Painting.”

Random examples across the six domains of DomainNet showing aircraft carriers, axes, bananas, and lions in different styles.

Experiment Findings: When Does Generalization Happen?

The researchers trained ResNet-50 CLIP models on these datasets and measured their “effective robustness”—a metric that compares how well they perform on the difficult test domain versus standard natural images.

Finding 1: Diversity is King

The first major finding confirms the intuition that diversity drives robustness. Models trained with high domain diversity (the red and purple lines in the chart below) significantly outperformed those trained on low diversity (blue and green lines).

Effective robustness plots comparing the four training setups. High diversity settings (red/purple) show consistently higher generalization on unseen classes compared to low diversity (blue/green).

In the chart above, look at the “Clipart” and “Sketch” panels. The red crosses (Leave-out-domain) are much higher than the blue circles (Natural-only). This proves that even if the model has never seen a sketch, seeing paintings and clipart helps it understand sketches better than just seeing photos does.

Finding 2: The “Add-on” Effect

Not all domains are created equal. The researchers experimented with adding domains one by one to see which ones boosted performance the most.

Bar charts showing the impact of adding specific domains to the training set. Adding domains generally improves performance on unseen classes.

As shown in Figure 3, adding domains generally pushes performance up (the bars get higher). However, there are diminishing returns, and the relationship between domains matters. For example, if you want to generalize to “Clipart,” seeing “Sketches” helps. But if you’ve already seen “Quickdraw,” adding “Sketch” might not help as much because they share similar visual features (black and white contours).

Finding 3: The Compositional Paradox

Here is where things get counter-intuitive.

Logically, Compositional Generalization (CG) should be easier than Domain Generalization (DG). In CG, the model has actually seen the test domain (just not the specific test classes). In DG, it hasn’t seen the domain at all.

However, the results showed something strange: CG performance was often worse than DG performance.

Table 1 highlights a “generalization gap.” Even in the best-case scenarios (CG high-diversity), the model performs worse than a model that simply saw the specific classes in the test domain (the “upper bound”).

Table showing the performance gap. Even high-diversity models lag behind models that have seen the specific class-domain samples (upper bound).

Why does this happen? The “Seen Class Bias”. The researchers discovered that when CLIP sees a specific subset of classes in a domain (e.g., sketches of cats), it overfits to them. It might learn a shortcut like “anything that looks like a pencil drawing is a cat.”

When you then ask it to identify a “sketch of a dog,” it fails because its internal definition of “sketch” is tightly coupled with “cat.” This is why sometimes not seeing the domain at all (DG) yielded better results than seeing a biased subset of it (CG).

To prove this, they ran an experiment where they partitioned the classes into three sets:

  1. \(C_1\): Classes seen during training (biased).
  2. \(C_2\): The target unseen classes.
  3. \(C_3\): Distractor classes not queried in the test.

Diagram explaining the class partitioning for investigating seen class bias.

They found that if they trained on sketches of \(C_3\) (classes that don’t overlap with the test or query set), the compositional generalization improved significantly. This suggests that class overlap is a dangerous trap for compositional learning.

Mechanistic Analysis: How Does CLIP Think?

The second half of the paper moves from “what happened” to “why it happened.” The researchers used techniques from mechanistic interpretability to look inside the “black box” of the neural network.

They hypothesized that successful generalization requires the model to learn shared representations and shared circuits. In other words, the neurons that fire for “dog” in a photo should be the same neurons that fire for “dog” in a sketch.

1. Visual Embeddings and Sparse Autoencoders

First, they looked at the output embeddings. They used Sparse Autoencoders (SAEs), a technique to decompose the model’s dense embedding vectors into interpretable “features.”

Equation defining the Sparse Autoencoder (SAE) with an encoder, ReLU nonlinearity, and decoder.

Using SAEs, they measured feature sharing. They found a strong correlation: models that generalized better shared more features across domains. If the model treats “sketch features” and “photo features” as totally different languages, it fails to generalize.

2. The Quickdraw Mystery

Throughout the experiments, one domain stood out as a failure case: Quickdraw (stick-figure doodles). Even with high diversity training, CLIP struggled to generalize to unseen Quickdraw classes.

Was it the captions? The researchers tried training with domain-invariant captions (e.g., just “a dog” instead of “a quickdraw of a dog”) to force the model to align the images.

UMAP visualizations of embeddings. Panel A shows distinct clusters for Quickdraw vs Others. Panel B shows better alignment when using invariant captions.

As Figure 5 shows, changing the captions forced the embeddings to align (Panel B is much more mixed than Panel A). However, generalization did not improve. This proves that just having aligned output embeddings isn’t enough.

If the outputs are aligned but the answer is wrong, the problem must be in the computation process—the circuits.

The researchers introduced a method to measure Mechanistic Similarity. They represented the model’s internal processing for a specific class as a graph (a circuit of neurons and their connections) and compared these graphs across domains.

They used two metrics:

  • Representational Similarity (CKA): Do the layers produce similar activation patterns?
  • Neuron Sharing (Jaccard Index): Are the same specific neurons being used?

The results were striking:

Charts showing representational similarity (CKA) and shared neurons. Quickdraw consistently shows the lowest similarity and sharing compared to other domains.

Look at the blue lines in Figure 6. For the Quickdraw domain, both representational similarity (CKA) and neuron sharing are drastically lower than for other domains.

This is the “smoking gun.” Although the final embedding might look okay, the internal mechanism used to process a Quickdraw image is fundamentally different from the mechanism used for a photo or painting. Because the circuits aren’t shared, the knowledge doesn’t transfer. The model effectively has a “Quickdraw brain” and a “Real brain” that don’t talk to each other.

To visualize this further, look at the neuron sharing across layers for all domains:

Detailed charts of shared neurons across layers. Quickdraw (blue) shows significantly less sharing, especially for unseen classes.

While domains like Sketch and Painting share a healthy amount of neurons with the “Others” (orange lines), Quickdraw (blue lines) lags behind, particularly in the deeper layers where high-level semantic processing happens.

Conclusion: The Recipe for Generalization

This research peels back the layers of CLIP’s impressive performance. It confirms that the “magic” of foundation models isn’t magic at all—it’s a product of specific data properties and internal mechanisms.

Key Takeaways:

  1. Diversity is non-negotiable: You cannot train a robust model on a homogeneous dataset. Domain diversity is the fuel for both domain and compositional generalization.
  2. Beware the “Seen Class” Trap: Seeing a domain partially can be dangerous. If a model learns to associate a style (like sketches) with only a few classes, it creates a bias that hurts its ability to recognize new classes in that style.
  3. It’s What’s Inside That Counts: Alignment of the final output embeddings isn’t enough. True generalization requires mechanistic similarity—the model must use the same internal circuits and neurons to process a concept, regardless of the artistic style or domain it appears in.

For students and practitioners, this implies that improving foundation models isn’t just about scraping more data. It’s about curating data that encourages the model to build universal, shared circuits rather than fragmented, domain-specific heuristics. As we push toward more general AI, understanding how these models connect concepts internally is just as important as measuring their accuracy on a test set.