Introduction
Imagine showing a picture of a horse riding on a person (a strange image, granted) to a state-of-the-art AI model. Then, you ask the model to pick the correct caption between two options: “a person riding a horse” and “a horse riding a person.”
Ideally, this should be easy. The nouns are the same, but the relationship is flipped. However, most modern Vision-Language Models (VLMs), including the famous CLIP, struggle significantly with this. They act like “Bag-of-Words” models—they see “horse,” they see “person,” and they declare a match, completely ignoring the syntax or the relationship described by the verb “riding.”
This limitation is known as a deficiency in compositional reasoning. While VLMs have revolutionized how computers understand images and text, their ability to understand attributes (is the shirt red?) and relations (is the cup on the table?) remains notably inadequate.
Today, we are diving deep into a paper titled “Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding”. The authors identify exactly why this happens and propose a novel framework called CAE (Composition Attribution Enhancement) to fix it.
Take a look at the figure below.

On the left, we see the problem in action. CLIP fails to distinguish who is underneath whom. On the right, we see the cause. The researchers analyzed what the model was paying attention to (attribution scores). The result? The model assigns significantly higher scores to objects (nouns) than to relations or attributes. Essentially, the model is lazy; it learns shortcuts by just spotting objects and ignoring the fine print.
In this post, we will explore how the CAE framework forces the model to stop being lazy and start paying attention to the words that actually matter.
Background: The Shortcut Problem in Contrastive Learning
To understand the solution, we first need to understand the environment in which these models live. Models like CLIP are trained using Contrastive Learning.
In a nutshell, CLIP takes an image and a text caption. It encodes the image into a mathematical vector and the text into another vector. The goal of training is to maximize the similarity (usually cosine similarity) between the correct image-text pair while pushing away incorrect pairs.
The “Cheat Code”
The problem lies in the training data. In massive datasets scraped from the internet, valid shortcuts exist. If you see a picture of a dog in a park, and the caption is “A dog running on the grass,” the model quickly realizes it only needs to recognize the visual features of a “dog” and “grass” to match the text. It rarely encounters a “hard negative” scenario where it has to distinguish “A dog on the grass” from “Grass on a dog.”
Consequently, the model learns to assign high importance (attribution) to object nouns because they are the most discriminative features for the general pre-training task. Words that define relationships (like “on”, “under”, “riding”) or attributes (like “red”, “large”) get sidelined.
This is the Composition Attribution Deficiency.
Existing Solutions vs. CAE
Previous attempts to fix this have largely focused on Data Augmentation, specifically “Hard Negative Mining.” This involves artificially creating tricky captions (e.g., swapping words to create “Grass running on a dog”) and forcing the model to learn the difference.
While effective, this is a data-centric approach. The authors of this paper propose a Model/Loss-centric approach. Instead of just throwing harder data at the model, why not change how the model learns by explicitly telling it to pay more attention to relation and attribute words?
Core Method: Composition Attribution Enhancement (CAE)
The core idea of CAE is to “trace” the attribution—identifying which words the model is focusing on—and then penalize the model if it focuses too much on objects at the expense of relations and attributes.
The Architecture Overview
Let’s look at the high-level workflow of the proposed method.

As shown in Figure 2, the pipeline has two main branches:
- The Standard Path: The image and text go through their respective encoders to calculate the standard Contrastive Loss (\(\mathcal{L}_{ITC}\)).
- The Attribution Path: The framework analyzes the text encoder to calculate Attribution Scores for every word. It then applies a regularization loss (\(\mathcal{L}_{Attr}\)) that forces the score of composition words (relations/attributes) to align with object words.
Step 1: Identifying the Words
Before we can balance the attention, we need to know which words are which. The authors use an off-the-shelf text scene graph parser.
- Object Tokens (\(A_{obj}\)): Nouns like “person”, “horse”, “car”.
- Composition Tokens (\(A_{comp}\)): Relations and attributes like “rides”, “underneath”, “red”.
Step 2: The Mathematics of the Loss
First, let’s recall the standard similarity score used in CLIP. It calculates the dot product between the image feature \(f_i(I)\) and text feature \(f_t(T)\), normalized and scaled by a temperature parameter \(\tau\).

The standard training objective involves the Image-Text Contrastive (ITC) loss, which is an average of image-to-text and text-to-image losses.

These component losses are standard Softmax-based cross-entropy losses, essentially asking the model to classify the correct pair out of a batch.


The New Contribution: Attribution Loss
Here is where the magic happens. The authors introduce a new loss term, \(\mathcal{L}_{Attr}\).
Let \(A_{obj}\) be the average attribution score of all object tokens in a batch, and \(A_{comp}\) be the average score of all composition tokens (relations/attributes). The goal is to minimize the gap between them.

In this equation, \(\epsilon\) is a margin parameter (usually set to 0). This loss function is active only when \(A_{obj} > A_{comp}\). It penalizes the model whenever it pays more attention to objects than to compositional details.
The final objective function combines the standard contrastive loss with this new attribution loss, balanced by a hyperparameter \(\lambda\).

Step 3: Four Ways to Calculate Attribution
The framework is generic, meaning it doesn’t care how you calculate importance (attribution), as long as you can do it effectively. The authors propose four different “instances” or methods to derive these scores.
1. Attention-Based Attribution
This is the simplest method. Transformer models (like the text encoder in CLIP) have built-in self-attention mechanisms.
- We look at the attention map of the
[CLS]token (the token that represents the whole sentence). - We see how much attention the
[CLS]token pays to every other word. - We average these weights across layers and heads to get the score for each word.
2. GradCAM-Based Attribution
Standard attention can sometimes be noisy. GradCAM is a more sophisticated interpretability technique that combines attention maps with gradients. It asks: “How much does the attention to this word specifically contribute to the final similarity score?”
It computes an “explainability map” \(\bar{E_i}\) for layer \(i\) by combining gradients (\(\nabla A\)) and attention maps (\(A\)):

These maps are then aggregated across layers using a propagation rule:

This method usually provides the most semantically accurate map of what the model is actually using to make a decision.
3. Perturbation-Based Attribution
This method is intuitive: If I delete this word, does the model get confused? The authors mask out specific tokens (replacing them with other concepts) and measure the drop in the similarity score \(S(I, T)\). If removing a word causes a massive drop in similarity, that word has a high attribution score.

4. Gradient-Based Attribution
This method looks at the raw sensitivity of the output with respect to the input word embeddings. It calculates the gradient of the similarity score with respect to the input token \(x_{ij}\). A high gradient means a small change in the input word would drastically change the output, implying high importance.

Does it actually change attribution?
Before looking at accuracy, let’s look at whether the method works as intended.

In Figure 7, we see the distribution of scores. The goal is to have the “Relation” and “Attribute” bars be comparable to the “Object” bars. The plots show that across different calculation methods, the model learns to distribute its attention more evenly after applying CAE.
Experiments & Results
The researchers fine-tuned a standard ViT-B/32 CLIP model on the MSCOCO dataset using their new loss function. They tested it on seven different benchmarks designed to test compositional understanding, including ARO, Winoground, and VL-Checklist.
Main Performance
The results were impressive.

Table 1 compares the proposed method (CLIP-CAE) against a standard pre-trained CLIP and a version of CLIP fine-tuned without the attribution loss (CLIP-FT).
- Consistent Gains: CLIP-CAE outperforms the baseline (CLIP-FT) across almost all benchmarks.
- Winoground: On the notoriously difficult Winoground benchmark (where random chance is 16.7% for the group score), the Attention-Based CAE improved the group score from 5.5% (Baseline) to 8.0%.
- Robustness: The improvements are seen regardless of which attribution method (Attention, GradCAM, etc.) is used, proving the framework is robust.
Synergy with Hard Negative Mining
A major question is: Does this replace Hard Negative Mining, or can they work together? The authors combined CAE with “NegCLIP” (a method that uses hard negatives).

As Table 2 shows, the two methods are orthogonal. When you combine them (bold numbers), you get the best of both worlds. For example, on ARO-Relation, the accuracy jumps to over 80%, significantly higher than the original CLIP’s 58.7%.
Does this hurt standard retrieval?
Often in AI, optimizing for a specific capability (like compositionality) degrades general performance (like finding the right image for a generic caption).

Table 3 shows that, generally, no. In fact, Text-to-Image (T2I) retrieval improved on MSCOCO. There was a slight dip in Image-to-Text retrieval on Flickr30K, likely because the regularization is applied only to the text encoder, but the overall general capabilities remain intact or improved.
Qualitative Analysis: Seeing is Believing
Numbers are great, but visualizations tell the story better. The authors used GradCAM to visualize the “heatmaps” on the images—showing where the model is looking when it processes the text.

In Figure 6, the prompt is “the cat is touching the elephant”.
- Original CLIP (Middle): The heatmap (red area) focuses entirely on the elephant’s head and body. It sees “elephant” and stops there. It ignores the interaction.
- CLIP-CAE (Right): The heatmap shifts significantly. It highlights the paw of the cat and the specific part of the elephant being touched. It is actually “seeing” the verb “touching.”
Furthermore, looking at the text attribution scores at the bottom of the figure, CLIP-CAE assigns a score of 0.31 to the word “touching,” whereas the original CLIP only gave it 0.16.
Better Text Embeddings
Finally, the authors checked if the text embeddings themselves were becoming semantically richer.

Table 4 shows that the text encoder of CLIP-CAE is better at Semantic Textual Similarity (STS) tasks. This means the model isn’t just better at matching images; it’s becoming a better language model, capable of understanding the nuances between similar sentences.
This is further supported by the similarity distribution analysis:

Figure 3 shows that the embeddings generated by CLIP-CAE (green box) have a higher cosine similarity to the embeddings of the specific relations/attributes contained within them compared to the baselines. This proves the text embeddings are encoding more of the compositional structure.
Conclusion
The “Bag-of-Words” behavior of Vision-Language Models has been a persistent headache for researchers. While the models are great at object recognition, they often fail to understand the scene—the complex web of interactions and attributes that make up the visual world.
The paper “Interpretable Composition Attribution Enhancement” offers a compelling solution that goes beyond simply cleaning up data. By identifying the root cause—Composition Attribution Deficiency—and directly optimizing the model’s loss function to fix it, the authors have provided a generic, plug-and-play framework.
Key takeaways:
- Interpretability as a Tool: Interpretability methods (like GradCAM) aren’t just for explaining what a model did after the fact. They can be integrated into the training loop to guide the model during learning.
- Focus Matters: Simply forcing the model to pay attention to “non-object” words significantly improves reasoning capabilities.
- Orthogonality: This method works well alongside existing data augmentation techniques, pushing the state-of-the-art even further.
As VLMs continue to evolve into Multimodal Large Language Models (MLLMs), ensuring they understand the syntax of reality—not just the nouns—will be crucial. CAE represents a significant step in that direction.
](https://deep-paper.org/en/paper/file-3216/images/cover.png)