Beyond Simple Similarity: How to Teach Vision-Language Models to Generalize Compositionally

Introduction

Imagine you are teaching a child what a “red apple” is. You show them a picture of a red apple. Now, you want them to understand a “green chair.” You show them a green chair. Finally, you present them with a “green apple”—an object they haven’t explicitly studied before, but which is composed of concepts they already know (“green” and “apple”). If the child recognizes it, they have demonstrated Compositional Generalization.

This ability—to understand unseen combinations by recombining known primitive concepts—is fundamental to human intelligence. For Artificial Intelligence, particularly Large Vision-Language Models (LVLMs) like OpenFlamingo or GPT-4V, this remains a significant hurdle. While these models are impressive, they often struggle to connect the dots when known visual and linguistic concepts appear in novel configurations.

In this deep dive, we will explore a fascinating research paper titled “In-Context Compositional Generalization for Large Vision-Language Models.” The researchers propose a novel way to improve this capability in LVLMs without retraining the model. Instead, they focus on In-Context Learning (ICL)—the art of selecting the perfect examples (demonstrations) to include in the model’s prompt.

We will uncover why standard methods of picking examples fail due to “visual redundancy,” and we will break down the authors’ sophisticated solution that mathematically balances content coverage, structural complexity, and diversity.

The Challenge: The Problem with Redundancy

To understand the solution, we must first understand the specific problem with current multimodal models.

In-Context Learning (ICL) Refresher

In-Context Learning is a paradigm where, instead of updating a model’s weights (fine-tuning), you simply provide a few examples in the input prompt. For a Visual Question Answering (VQA) task, a prompt might look like this:

Image 1: [Picture of a cat] | Q: What animal is this? | A: Cat Image 2: [Picture of a dog] | Q: What animal is this? | A: Dog Test Image: [Picture of a bird] | Q: What animal is this? | A: [Model predicts here]

The standard way to pick Image 1 and Image 2 is by finding examples that are “similar” to the Test Image. Usually, this is done using a metric like CLIP similarity, which compares the overall feature vectors of images and text.

The Asymmetry Trap

However, this paper argues that simple similarity is dangerous in multimodal settings. There is an inherent asymmetry between visual and linguistic modalities.

An image carries a massive amount of information—background scenery, lighting, multiple objects, colors, and textures. A text question, conversely, is usually specific (“Is the dog white?”). When we match examples based on general image similarity, we often retrieve images that match the irrelevant background noise (redundant information) rather than the core concept required to answer the question.

Illustration of the problems stemming from redundant information in ICCG for LVLMs. (a) Multimodal similarity is dominated by redundant information. (b) More redundant information in in-context demonstrations brings more difficulties in answering the sample.

Figure 1 above illustrates this phenomenon perfectly:

Case (a): The model needs to answer “Is the image indoors?” for an outdoor street scene. The top demonstration has a high similarity score (0.70) because both images contain a dog. However, the dog is redundant to the question of location. The demonstration shows an indoor scene, confusing the model into answering “Yes.” The bottom example has a lower similarity score (0.57) but is less confusing, leading to the correct answer.
Case (b): This shows how redundant info acts as noise. If the demonstration images contain too many extra details (like the background environment) that clash with the test case, the model fails to map the correct attribute (color) to the object.

The core thesis of the paper is this: To achieve In-Context Compositional Generalization (ICCG), we must stop selecting demonstrations based on raw similarity. Instead, we need a selection strategy that maximizes the coverage of relevant concepts while minimizing redundant, misleading visual noise.

The Method: A Dual-Factor Approach

The researchers propose a new demonstration selection method that considers two distinct dimensions of the data: Content and Structure.

Overview of the proposed demonstration selection method. We design a matching score between demonstrations and test cases considering both their content and structure, to greedily select demonstrations.

As shown in Figure 2, the pipeline involves extracting “Primitives” (basic concepts like objects and attributes) and “Structures” (how those primitives relate, like “grazing in” or “on top of”).

For every potential demonstration example (\(X'\)) and the specific test case (\(X\)), the method calculates a comprehensive Matching Score (\(M\)). The goal is to iteratively select demonstrations that maximize this score.

The master equation is defined as:

Equation for the total matching score

Here, \(C\) represents the Content Matching score, and \(S\) represents the Structural Matching score. The weight \(w_c\) balances the two, though the authors note that content coverage is generally the priority.

Let’s break down these two components mathematically and conceptually.

1. Content Matching: Coverage, Diversity, and Redundancy

Content matching isn’t just about finding the same objects; it’s about finding the right objects while avoiding clutter. The formula for Content Matching is:

Equation for Content Matching Score

This looks complex, but it is composed of three intuitive parts:

Coverage (\(P(X) \cap P(X')\)): Does the demonstration contain the primitives (words/objects) present in the test case?
Diversity (\(- P(E)\)): We subtract the primitives present in the examples we have already selected (\(E\)). This ensures we aren’t just picking the same concept over and over. We want to cover new aspects of the test case with each selected demonstration.
Redundancy Penalty (\(- w_r \cdot R(X')\)): We penalize demonstrations that contain too much extra information.

Evaluating Primitives Across Modalities

The equation for content coverage handles both linguistic (\(l\)) and visual (\(v\)) modalities:

Equation for Content Coverage

To make this work, the system extracts text primitives using a parser and visual primitives using a captioning model (like BLIP-2). It then checks for intersections. For example, if the test case is about a “zebra,” the selection algorithm gets points for finding a demonstration that also contains “zebra,” regardless of whether the concept was found in the image caption or the question text.

The Diversity Mechanism

Similarly, diversity ensures we are filling in the gaps. If we have already selected a demonstration that explains the concept of “grass,” the next demonstration should ideally focus on “zebra” or “eating,” not “grass” again.

Equation for Content Diversity

The Redundancy Penalty

This is the most unique part of their contribution. How do you measure “redundancy” mathematically? The authors define it as the symmetric difference between the visual and linguistic content of the demonstration.

Equation for Content Redundancy

What this means: Ideally, the image and the text should be perfectly aligned. The image shows a cat, the text says “cat.”

If the image shows “cat, tree, sun, car” but the text only mentions “cat,” the intersection is small, but the union is huge. The difference represents visual noise—elements in the image that aren’t relevant to the text interaction.
By penalizing this score, the algorithm prefers “clean” demonstrations where the image succinctly depicts exactly what is being discussed in the text, minimizing the chance of the model learning wrong correlations.

2. Structural Matching: Syntax and Complexity

While content handles what is in the scene, structure handles how things are arranged. This is crucial for compositional generalization (e.g., distinguishing “man on horse” from “horse on man”).

The structural matching score uses a similar logic of Coverage, Diversity, and a penalty for Complexity:

Equation for Structural Matching Score

Extracting Structure

The system builds constituent trees (parse trees) for the text and the image captions. It looks for sub-structures (sub-trees) with a depth of 3 or less. These represent grammatical relationships like “noun phrase” or “prepositional phrase.”

Structural Coverage checks if the demonstration shares the same grammatical structures as the test case:

Equation for Structural Coverage

Structural Diversity ensures we expose the model to different sentence structures or visual compositions, rather than repeating the same syntax:

Equation for Structural Diversity

The Complexity Penalty

Finally, the method penalizes Structural Complexity. The hypothesis is that simpler demonstrations are better for teaching. A complex sentence with multiple nested clauses or a visually chaotic image requires complex reasoning that might distract the model.

Complexity is approximated by the depth of the constituent tree:

Equation for Structural Complexity

By minimizing the tree depth, the algorithm favors simple, declarative examples (e.g., “The cat is on the mat”) over complex ones (e.g., “The cat, which is black, is sleeping on the mat that lies on the floor”).

Experimental Setup

To prove that this selection method works, the authors couldn’t just use standard datasets. Standard datasets often have training and testing splits that share similar compositions, which allows models to “cheat” by memorizing patterns rather than generalizing.

The GQA-ICCG Dataset

The authors constructed a new benchmark called GQA-ICCG, derived from the GQA visual reasoning dataset.

Filtering: They removed test samples that contained compositions of primitives already seen in the training set. This guarantees that the test cases represent novel compositions.
Candidate Set: For every test case, they curated a pool of potential demonstrations that contained the relevant primitives.
Scale: The dataset includes 10,000 test cases and over 48,000 candidate demonstrations.

They also tested on the standard VQA v2 dataset to ensure their method didn’t hurt performance on standard (non-compositional) tasks.

Baselines

They compared their method against several strategies:

Random: Picking random images.
Similarity (CLIP): The standard “nearest neighbor” approach.
SDC / Cover-LS: Advanced text-based selection methods adapted for this task.

They tested these on four different LVLMs: OpenFlamingo (3B, 4B, 9B), Otter, FROMAGe, and IDEFICS.

Results and Analysis

The results were compelling. The proposed method consistently outperformed the baselines across different models and “shot” counts (the number of examples provided).

1. Performance on GQA-ICCG

Table 1: Accuracy of state-of-the-art methods on GQA-ICCG

Note: The table image above (Table 7 in the source deck) shows extensive results. We can also look at the summarized performance below.

Table 1: Accuracy of state-of-the-art methods on GQA-ICCG.

Looking at the main results (Table 1):

Ours (the proposed method) achieves the highest accuracy across almost all configurations.
For OpenFlamingo-4B (8-shot), the accuracy jumps from roughly 46% (SDC baseline) to 51.65%.
This confirms that carefully selecting demonstrations based on content purity and structure is far superior to standard similarity searches.

2. Generalization to Standard VQA

Does this specialized “compositional” selection hurt performance on normal tasks?

Table 2: Accuracy on VQA v2

According to Table 2, the answer is no. On the VQA v2 dataset, the method (“Ours”) still outperforms Random and Similarity-based baselines. This suggests that reducing redundancy and managing structural complexity is a universally good strategy for In-Context Learning, not just for compositional tasks.

3. Why does it work? (Ablation Studies)

The authors performed rigorous ablation studies—removing parts of their equation to see what matters most.

Impact of Content Factors:

Table 4: Accuracy under different content coverage settings

Table 4 reveals that Coverage is the most critical factor. If the demonstration doesn’t contain the relevant objects (primitives), performance drops significantly. However, Diversity (ensuring varied examples) adds a noticeable boost on top of coverage.

Impact of Redundancy:

Figure 3: Accuracy of OF-4B under different content redundancy settings

Figure 3 validates the redundancy hypothesis.

The blue bar (\(w_r = -0.05\)) represents a setting where redundancy is encouraged (high redundancy). Accuracy drops to 44.77%.
The purple bar (\(w_r = 0.05\)) represents the proposed method where redundancy is penalized. Accuracy peaks at 45.78%.
This empirical evidence supports the claim that cleaner, less cluttered images make for better teaching examples.

Impact of Structural Factors:

Figure 4: Accuracy under different structural complexity settings

Similarly, Figure 4 shows that penalizing structural complexity (preferring simple trees) leads to higher accuracy compared to ignoring it or encouraging complexity.

4. Qualitative Comparison

Ideally, we want to see that the model is actually “thinking” better. Let’s look at some visual examples.

Qualitative comparisons between SDC and our method on GQA-ICCG

In Figure 5, look at the first row (the stone statue).

The Test: “Is the stone statue at the top or bottom of the clock?”
SDC (Baseline): Selects demonstrations that are textually relevant but visually irrelevant or confusing. The model predicts “bottom” (Incorrect).
Ours: Selects demonstrations that show spatial relationships clearly with less visual clutter. The model predicts “top” (Correct).

Another example set using the Otter model:

Qualitative comparisons between SDC and our method on GQA-ICCG (Otter model)

In Figure 6, row 2 (The bench):

Question: “What is the bench made of?”
SDC: The demonstrations picked by the baseline fail to guide the model, leading to a prediction of “wood” (incorrect).
Ours: The method selects examples that clearly delineate materials in similar contexts. The model correctly identifies “metal.”

Conclusion and Implications

The paper “In-Context Compositional Generalization for Large Vision-Language Models” provides a significant step forward in our understanding of how multimodal models learn from examples.

The key takeaway is that more data is not always better; cleaner data is. By mathematically defining “clean” data as examples that have high content coverage but low visual redundancy and structural complexity, the researchers were able to unlock better reasoning capabilities in existing models without any fine-tuning.

Key Implications:

The Redundancy Gap: This work highlights a fundamental issue in multimodal AI—the information imbalance between rich images and sparse text. Solving this is key to better vision-language understanding.
Training-Free Improvement: The method is “plug-and-play.” It can be applied to any existing LVLM (like GPT-4V or Gemini) simply by changing how the prompt is constructed.
Compositionality: This brings us closer to AI that learns like humans—understanding the world as a set of building blocks that can be infinite recombined, rather than just memorizing static scenes.

As LVLMs continue to grow in size and capability, prompt engineering strategies like this—grounded in rigorous analysis of content and structure—will likely become standard practice for deploying robust AI systems.

Introduction#

The Challenge: The Problem with Redundancy#

In-Context Learning (ICL) Refresher#

The Asymmetry Trap#

The Method: A Dual-Factor Approach#

1. Content Matching: Coverage, Diversity, and Redundancy#

Evaluating Primitives Across Modalities#

The Diversity Mechanism#

The Redundancy Penalty#

2. Structural Matching: Syntax and Complexity#

Extracting Structure#

The Complexity Penalty#

Experimental Setup#

The GQA-ICCG Dataset#

Baselines#

Results and Analysis#

1. Performance on GQA-ICCG#

2. Generalization to Standard VQA#

3. Why does it work? (Ablation Studies)#

4. Qualitative Comparison#

Conclusion and Implications#

Key Implications:#