Introduction
Imagine you have taught a child what a “red ball” looks like and what a “blue cube” looks like. If you then show them a “red cube” or a “blue ball,” they will likely identify it immediately. This ability to understand new combinations of familiar concepts is called systematicity, or compositional generalization. It is a fundamental cornerstone of human intelligence. We don’t need to see every possible combination of color and shape in the universe to understand how they fit together.
However, artificial neural networks often struggle with this. If you train a Visual Question Answering (VQA) model on red balls and blue cubes, but never show it a red cube, it might fail completely when asked to identify one. This failure represents a systematicity gap—the difference in performance between reasoning on seen combinations versus unseen combinations.
For years, a common assumption in Deep Learning has been “scale is all you need.” If the model isn’t generalizing, just feed it more data. But a fascinating research paper, Attribute Diversity Determines the Systematicity Gap in VQA, challenges this notion. The researchers introduce a new diagnostic dataset, CLEVR-HOPE, and demonstrate that simply increasing the quantity of training data does not fix the systematicity gap. Instead, the key lies in the diversity of that data.
In this post, we will break down this paper, explore the CLEVR-HOPE dataset, and understand why “attribute diversity” is the secret ingredient for teaching machines to think more like humans.
Background: VQA and The Compositionality Problem
Visual Question Answering (VQA) is a multi-modal task where an AI system is given an image and a natural language question about that image (e.g., “What color is the shiny cylinder?”). The model must process the visual information and the text to produce an answer.
While modern Transformer-based models like LXMERT achieve high accuracy on standard benchmarks, they often rely on statistical shortcuts rather than true reasoning. For instance, if the training data contains mostly “yellow” bananas, the model might learn to associate the word “banana” with the color output “yellow” without actually looking at the color in the image.
This reliance on correlations breaks down when the model faces Out-Of-Distribution (OOD) data—specifically, novel combinations of attributes (like a “purple banana” or a “rubber cylinder” if it has only seen metal ones).
The goal of this research was to isolate this specific phenomenon. The authors wanted to answer two questions:
- Does training on more data help the model learn to combine concepts systematically?
- If not, what property of the data actually drives this ability?
The Core Method: Introducing CLEVR-HOPE
To test systematicity rigorously, you cannot use standard “in the wild” datasets like COCO or VQA v2, because it is nearly impossible to track exactly which combinations of attributes (like “fuzzy” + “dog”) the model has seen.
The researchers built upon CLEVR, a synthetic dataset of 3D-rendered geometric shapes (cubes, spheres, cylinders) with various attributes (material, color, size, shape). They created a new diagnostic dataset called CLEVR-HOPE (CLEVR Held-Out Pair Evaluation).
Held-Out Pairs (HOPs)
The core mechanic of CLEVR-HOPE is the Held-Out Pair (HOP). A HOP is a specific combination of two attribute values—for example, Material: Rubber and Shape: Cylinder—that is strictly removed from the training set.
The model is trained on thousands of images containing rubber objects and cylinder objects, but never a single object that is both rubber and a cylinder.
The Dataset Splits
To diagnose exactly where the model fails, CLEVR-HOPE is divided into very specific splits. This architecture is crucial for understanding the results.

As shown in Figure 1 above, the dataset is structured around the HOP (in this example, a “rubber cylinder”):
- Train: Standard CLEVR-style images and questions. Crucially, the “rubber cylinder” never appears here, neither in the image nor in the text.
- Complex-IID (In-Distribution): Test data that follows the training distribution. It requires complex reasoning (counting, comparing) but contains no rubber cylinders.
- Complex-OOD (Out-Of-Distribution): Test data that requires complex reasoning and does contain rubber cylinders. This tests if the model can use the unseen combination in a complex task.
- Minimal-IID: Simple “existence” questions (e.g., “Is there a metal sphere?”) containing only seen combinations. The image contains only a single object.
- Minimal-OOD: Simple “existence” questions regarding the HOP (e.g., “Are there any rubber cylinders?”). The image contains only a single object.
The Minimal splits are a brilliant addition. They strip away the complexity of counting or spatial reasoning (like “on the left of”). If a model fails the Complex-OOD test, we might wonder if it failed because the question was too hard or because it didn’t recognize the object. The Minimal-OOD test isolates the recognition capability. If the model fails here, it simply doesn’t know what a “rubber cylinder” is.
Ensuring Fair Testing
To ensure the model isn’t just guessing, the Minimal-OOD tests are carefully balanced with “distractor” images.

In Figure 7, you can see how the minimal test works. For the question “Are there any matte cylinders?” (where matte cylinder is the HOP), the test set includes:
- Positive example: A matte cylinder (Answer: Yes).
- Distractor 1: A matte cube (Matches material, wrong shape).
- Distractor 2: A metallic cylinder (Wrong material, matches shape).
- Distractor 3: A metallic cube (Wrong material, wrong shape).
This forces the model to actually understand both attributes simultaneously, rather than latching onto just “matte” or just “cylinder.”
The Experiments
The researchers tested two main model architectures:
- LXMERT: A Transformer-based model that processes vision and language together (similar to BERT but for VQA). They tested both a version pretrained on other data and a version trained from scratch.
- Tensor-NMN: A Neuro-Symbolic model which breaks questions down into a program (e.g.,
count(filter_shape(cylinder))).
They trained these models on 29 different Held-Out Pairs, varying the size of the training set (25k, 200k, and 560k images) to check the “scale” hypothesis.
Results: The Quantity Myth vs. The Diversity Reality
The results provided a nuanced view of how these models learn.
1. The Systematicity Gap Exists and Persists
First, the good news: models can generalize. With enough training data, accuracy on the Out-Of-Distribution (OOD) sets was generally high. However, it was consistently lower than the In-Distribution (IID) accuracy.
This difference is the Systematicity Gap.
If the “scale is all you need” hypothesis were true for systematicity, we would expect this gap to disappear as we added more training data. The model would become so robust that unseen combinations would be just as easy as seen ones.

However, Figure 5 shows the opposite. While the gap narrows slightly at first, it quickly plateaus. Even as the training set grows from 25k to 560k samples, a persistent gap of about 5-6% remains for LXMERT (and worse for other models). Adding more data improves overall accuracy, but it improves IID and OOD accuracy at roughly the same rate, leaving the gap between them unchanged.
2. Attribute Diversity is the Key
If quantity isn’t the fix, what is? The researchers looked at the Attribute Diversity of the training data.
Attribute Diversity is defined as the number of possible attribute values corresponding to the held-out pair’s types.
- Low Diversity: Consider holding out a “Large Rubber” object. There are only 2 sizes and 2 materials in CLEVR. Total combinations = \(2 \times 2 = 4\).
- High Diversity: Consider holding out a “Red Cylinder.” There are 8 colors and 3 shapes. Total combinations = \(8 \times 3 = 24\).

Table 1 lists these diversities. The researchers found a striking correlation: High diversity leads to a smaller systematicity gap.
When the model is forced to learn 24 different combinations of Color+Shape, it seems to internalize the rule of combining color and shape. When it only sees 4 combinations of Size+Material, it likely just memorizes them as individual atomic concepts.

Figure 2 visualizes this discovery perfectly. Look at the lines for Diversity 24 (blue) and Diversity 16 (orange). The systematicity gap is near zero (or even positive, meaning OOD performed better!).
Now look at Diversity 4 (red). The gap is massive (around -13%). The model fails significantly harder on unseen combinations when the attribute diversity is low.
3. Visualizing the Diversity Effect
This relationship between attribute types and generalization performance can be visualized as a heatmap.

In Figure 23, the axes represent different attribute types. The “lighter” colors represent a large (negative) systematicity gap—meaning poor generalization. The “darker” colors represent a small gap.
Notice the pattern? The top-left corner corresponds to Material and Size, which have very few possible values (Low Diversity). This area is bright yellow/green, indicating the models struggle to generalize here.
The bottom-right corner corresponds to Color and Shape, which have many values (High Diversity). This area is dark, indicating the models generalize almost perfectly.
4. Controlling for Confounders
A skeptic might ask: “Maybe color is just easier to learn than material? Maybe it’s not about the number of combinations (diversity), but the specific attributes themselves?”
To address this, the researchers ran a control experiment. They took a high-diversity pair (Color + Shape) but artificially restricted the training data to show only a few combinations (simulating low diversity).

Figure 6 confirms their hypothesis. Even for Color + Shape, when the diversity is artificially lowered to 4 (red line/area), the systematicity gap widens significantly (-7.5%). As they allowed more combinations (moving to 16 and 24), the gap vanished. This proves that diversity, not the specific attribute type, drives the ability to generalize.
Conclusion & Implications
This research highlights a critical limitation in how we approach dataset construction. We often focus on the sheer volume of data—scraping billions of images or tokens. While volume helps a model memorize and recognize patterns it has seen before, this paper suggests that volume alone does not teach the model the “rules” of the world.
To build AI that can truly reason systematically—that can imagine a “rubber sphere” having only seen rubber cubes and metal spheres—we need to optimize for diversity.
Key Takeaways:
- The Gap is Real: Models perform worse on unseen attribute combinations, even if they know the individual attributes.
- Scale Plateaus: Simply adding more training data of the same distribution does not close the systematicity gap.
- Diversity is King: The number of distinct concept combinations seen during training is the primary predictor of systematic generalization.
For students and practitioners, this offers a practical lesson: when curating data for a new task, ensure your model sees concepts in as many different contexts as possible. Don’t just show it a “red apple” 10,000 times. Show it red cars, red shirts, and red skies. That contextual diversity is what allows the neural network to detach the concept of “red” from the object “apple” and apply it to something it has never seen before.
](https://deep-paper.org/en/paper/2311.08695/images/cover.png)