Decoding Vision: How to Find Compositional Structures Inside Image Embeddings

Humans are natural composers. When you see a “red car,” you don’t just see a unique, atomic entity; you instinctively understand it as a combination of an object (“car”) and an attribute (“red”). This ability to break down complex concepts into simpler, reusable parts is called compositionality. It allows us to understand things we’ve never seen before. If you know what “blue” looks like and what a “banana” looks like, you can imagine a “blue banana” without ever having encountered one.

In the world of Artificial Intelligence, Vision-Language Models (VLMs) like CLIP have revolutionized how computers understand images by mapping text and visual inputs into a shared “embedding space.” We know that the text side of these models is compositional—you can use vector arithmetic on words (e.g., King - Man + Woman = Queen). But what about the visual side? Do VLMs organize images in a similarly structured, compositional way?

A recent research paper, “Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models,” dives deep into this question. The authors propose a mathematically rigorous framework called Geodesically Decomposable Embeddings (GDE) to prove that, yes, visual embeddings are compositional—but unlocking that structure requires navigating the complex, curved geometry of the latent space.

In this post, we will break down how GDE works, why simple linear math fails for images, and how this method can be used to debias models and even generate strange new hybrid animals.

Figure 1: Overview of Compositional Structures in Visual Embedding Space. The left side shows how visual inputs (like shoes) are encoded into a space where attributes (red/blue) and objects (heels/sandals) form regular geometric shapes. The right side shows applications: classification, robustness, and generation.

The Problem: Why Images are Harder than Text

To understand why this research is necessary, we first need to look at the difference between text and images in latent space.

Previous research demonstrated that text embeddings in CLIP can be approximated by linear combinations. If you want a vector for “red car,” you can roughly sum the vector for “red” and the vector for “car.” This is possible because language is symbolic and discrete.

Images, however, present two massive challenges:

  1. Noise and Ambiguity: You cannot simply “write” a pure image of a concept. An image of a “red car” inevitably contains extra information: a road, a driver, a background, lighting conditions, etc. This “noise” muddies the embedding.
  2. Sparsity: In language, we can easily type “blue apple.” In image datasets, however, certain combinations of attributes and objects simply do not exist.

Furthermore, CLIP embeddings are normalized. They live on the surface of a hypersphere (a high-dimensional sphere), not on a flat plane. While linear arithmetic (\(A + B = C\)) works well enough for text in small regions, it fails to respect the intrinsic curvature of the sphere when dealing with complex visual data.

The authors of this paper argue that to find compositionality in images, we must respect the Riemannian geometry of the embedding space.

The Solution: Geodesically Decomposable Embeddings (GDE)

The core contribution of this paper is a framework that decomposes visual concepts while respecting the spherical geometry of the data. The method is called Geodesically Decomposable Embeddings (GDE).

1. The Geometry: From Linear to Geodesic

Imagine you are trying to find the average location between Tokyo and New York. If you draw a straight line through the map (assuming a flat Earth), you’d tunnel through the Earth’s crust. To get a valid path, you must draw a curve along the surface. This curve is called a geodesic.

The same logic applies to embeddings on a hypersphere. To decompose embeddings properly, we cannot just add vectors. We need to:

  1. Project points from the curved manifold (\(\mathcal{M}\)) onto a flat tangent space (\(T_\mu\mathcal{M}\)) at a specific reference point (\(\mu\)).
  2. Perform arithmetic (addition/subtraction) in this flat tangent space.
  3. Project back onto the manifold.

The authors define a set of embeddings as geodesically decomposable if composite concepts (like “red car”) can be formed by moving along geodesics determined by primitive parts (“red” + “car”) starting from a central context.

Figure 2: The Decomposition Method. (Top-Left) Concepts exist on a manifold sphere. (Bottom) They are mapped to a flat tangent space where vector means are computed. (Top-Right) The resulting vectors are mapped back to the manifold.

2. The Math: Logarithms and Exponentials

How do we move between the curved sphere and the flat tangent space? We use two geometric operations:

  • Logarithmic Map (\(\text{Log}_\mu\)): Takes a point on the sphere and maps it to a vector in the tangent space. This “flattens” the local geometry.
  • Exponential Map (\(\text{Exp}_\mu\)): Takes a vector from the tangent space and wraps it back onto the sphere.

Mathematically, if we have a set of primitive concepts (like attributes \(z_1\) and objects \(z_2\)), we want to approximate a composite embedding \(\mathbf{u}_z\) using vectors \(\mathbf{v}\) in the tangent space:

Equation showing the exponential map formula for reconstruction.

Here, \(\mu\) is the intrinsic mean—the “center of mass” on the sphere for the dataset. It acts as the anchor point for our tangent space.

3. Handling Visual Noise

This is where the framework gets clever about image data. Because a single image of a “red car” is noisy (it has a background), we cannot rely on just one sample.

The authors treat each concept (e.g., “red car”) not as a single point, but as a distribution of many images. By averaging the tangent vectors of many different images representing the same concept, the unique “noise” of each image cancels out, leaving behind the true semantic direction.

The optimal direction for a primitive concept (like “red”) is calculated by averaging the tangent vectors of all pairs containing that concept:

Equation for computing the primitive vector v_zi by averaging tangent vectors.

This implies that the vector for “red” is the average of (“red car”, “red apple”, “red dress”, etc.) minus the global mean. The background noise of cars, apples, and dresses averages out, isolating the concept of “red.”

Visualization: Seeing the Geometry

Does this actually work? If visual embeddings are compositional, they should form regular geometric shapes in the embedding space. For example, the relationship between “Red” and “Blue” should be consistent whether applied to “Cars” or “Birds.”

The authors visualized the tangent vectors of these embeddings. In a perfect scenario, concepts formed by 2 attributes and 2 objects should form a parallelogram. Concepts formed by 2 attributes and 3 objects should form a triangular prism.

Figure 3: Geometric Arrangement of Embeddings. Top row shows 2D projections of 2x2 concepts (parallelograms). Bottom row shows 3D projections of 2x3 concepts (prisms). Increasing ‘k’ (number of images averaged) makes the shapes clearer.

As shown in Figure 3 above, when we increase \(k\) (the number of images used to compute the mean), the noise disappears, and these beautiful, regular geometric structures emerge. This confirms that the VLM has indeed learned a structured, compositional representation of the visual world.

Experiment 1: Compositional Classification

The first major test of GDE is Compositional Zero-Shot Learning.

The Challenge: The model is trained on a set of seen pairs (e.g., “Old Shoe”, “New Shirt”). At test time, it must recognize an unseen combination, like “Old Shirt.”

The authors compared their GDE (curved geometry) approach against a baseline LDE (Linear Decomposable Embeddings - flat geometry). They tested on datasets like UT-Zappos (shoes) and MIT-States (objects with states).

The Results: GDE significantly outperformed the linear baseline, especially on image data.

  • On UT-Zappos, GDE achieved a 317% improvement over LDE in relative performance compared to the standard CLIP baseline.
  • This proves that modeling the curvature of the space is essential. Treating the sphere as a flat plane (Linear) results in significant errors when combining visual concepts.

Experiment 2: Group Robustness (Debiasing)

This is perhaps the most impactful application of the framework.

The Problem: VLMs often learn “spurious correlations.” For example, in the “Waterbirds” dataset, the model might learn that “Water background = Waterbird.” If you show it a waterbird on land, it often fails.

The GDE Solution: We can use GDE to disentangle the object (“bird”) from the context (“background”). By computing the primitive vector for “Waterbird” using GDE, we effectively “average out” the background information (water/land) from the training data. This leaves a pure “Waterbird” vector that is robust to background shifts.

Figure 4: Group Robustness Results. The charts show that GDE maintains high accuracy even with very few support samples, significantly outperforming baselines in ‘Worst Group’ (WG) accuracy.

The results (Table 3 in the paper) were striking. GDE achieved state-of-the-art results in Worst Group Accuracy without requiring any model fine-tuning. It outperformed complex supervised methods like Deep Feature Reweighting, simply by understanding the geometry of the embedding space.

Experiment 3: Generative Alchemy

Finally, if GDE truly captures the essence of “attribute” and “object” vectors, we should be able to invert them to create images. The authors used Stable Diffusion (specifically the unCLIP variation) to generate images from their decomposed vectors.

They tested this by mathematically mixing concepts. For example, what happens if you add the vector for “Tiger” and “Horse”?

Figure 5: Image Generation results. The third row shows ‘Object-Object’ blends, creating hybrid animals like a Tiger-Zebra and a Bear-Raccoon.

The results are distinct photorealistic hybrids. This isn’t just style transfer; the model is generating a creature that semantically sits between the two concepts.

They also experimented with attribute scaling. In the tangent space, vectors have magnitude. By multiplying the attribute vector by a scalar \(\alpha\) before mapping it back to the manifold, they could control the intensity of the attribute.

Figure 8: Scaling Attributes. Top row: Increasing the ‘Burnt’ vector on ‘Pizza’. Bottom row: Increasing the ‘Faux Fur’ vector on ‘Boots’.

As \(\alpha\) increases (moving from left to right in Figure 8), the pizza gets progressively more burnt, and the boots get progressively furrier. This demonstrates that the “directions” found by GDE are semantically consistent and manipulatable.

Conclusion: The Shape of Meaning

The paper “Not Only Text” provides compelling evidence that vision-language models like CLIP don’t just memorize images; they organize them into sophisticated, human-like compositional structures.

However, accessing this structure isn’t straightforward. We cannot treat these complex embedding spaces as simple flat surfaces. By applying Riemannian geometry via the GDE framework, the authors successfully:

  1. Denoised visual representations.
  2. Decomposed images into reusable primitive parts.
  3. Improved performance in classification and fairness tasks.

This work bridges the gap between the symbolic compositionality of language and the continuous, noisy nature of vision. It suggests that as we build more powerful AI, understanding the geometry of how they think is just as important as the data we feed them.

For students and researchers, this is a valuable lesson: sometimes the key to better performance isn’t a bigger model, but better math.