Humans are masters of “compositional generalization.” If you know what “spinning” means, and you know what “pulling a red lever” means, you can instantly understand the command “pull the red lever while spinning,” even if you have never physically performed that specific combination of actions before. You don’t need to see a tutorial for every possible combination of words and actions; you understand the components and the rules for combining them.
In the world of Artificial Intelligence, however, this remains a stubborn hurdle. Deep learning models often struggle to generalize to new combinations of known concepts. This problem becomes even harder in Grounded Language Learning, where an agent (like a robot or a virtual avatar) must interpret instructions relative to a visual world state.
A popular technique called In-Context Learning (ICL)—where the model is given a few “support” examples before being asked to solve a query—has shown promise. But there is a catch: standard ICL relies on retrieving good examples from the training data. What happens when the training data doesn’t contain the specific scenario you are facing?
In the paper “Generating Demonstrations for In-Context Compositional Generalization in Grounded Language Learning,” researchers from Aalto University propose a novel solution called DemoGen. Instead of searching for imperfect examples in a database, their agent generates its own support examples tailored to the current situation.
This post explores how DemoGen works, why retrieval fails in grounded settings, and how generating synthetic demonstrations can unlock state-of-the-art performance on complex generalization benchmarks.
The Problem: Grounded Compositional Generalization
To understand the innovation of DemoGen, we first need to define the specific type of difficulty the researchers are tackling.
Inductive vs. Productive Generalization
Compositional generalization generally falls into two buckets:
- Inductive: The model sees a new combination of inputs but produces a known output symbol.
- Productive: The model must produce a novel combination of output symbols.
Productive generalization is significantly harder. Imagine a robot that knows how to WALK and how to PUSH. If you ask it to “walk and push,” it must generate a sequence WALK PUSH that it might never have produced during training.
The “Grounded” Challenge
In grounded learning, the correct action depends on the state of the world. The command “walk to the yellow square” results in a completely different sequence of motor actions depending on where the agent is standing and where the square is located.
This makes the standard approach to In-Context Learning difficult. Usually, if a model is confused, we retrieve similar examples from the training set. But in a grounded environment, finding a “similar” example is a nightmare. You might find an example of “walking to a yellow square” from the training data, but if the layout of the room in that example is different from the current room, the action sequence will be completely different. The example becomes noise rather than a helpful hint.
Why Retrieval is Not Enough
The researchers performed a deep analysis of nearest-neighbor similarity in the gSCAN dataset (a standard benchmark for grounded language learning). They looked for training examples that were similar to the test cases.

As shown in Figure 3, the similarity drops off rapidly. Even the closest neighbors in the training set often have significant differences in the environment layout (Hamming similarity often below 0.8).
This confirms a critical hypothesis: You cannot simply retrieve your way out of the productive generalization problem. If the specific state-instruction pair doesn’t exist in the training data, retrieving the “nearest” one often yields an example where the agent is in the wrong place or the obstacles are different, leading to irrelevant action sequences.
The Solution: DemoGen
If the perfect support examples don’t exist in the training data, we must create them. This is the core philosophy of DemoGen.
The method operates in a three-stage pipeline, designed to generate a “curriculum” of examples for the current specific query.

1. The Instruction Generator
The process starts with the Instruction Generator (seen in the middle of Figure 1). This is a masked language model (specifically, a BART-like architecture). It takes the current query instruction (e.g., “spin and pull a small yellow cylinder”) and randomly masks parts of it. It then reconstructs new instructions that are relevant to the current visual scene.
The goal is to generate instructions that are similar to the query but simpler or slightly different. For a query like “pull the yellow cylinder,” it might generate:
- “Pull a small yellow cylinder”
- “Walk to a yellow cylinder”
- “Pull a red cylinder”
Crucially, the authors use a scoring model to filter these generated instructions, keeping only those that are “in-distribution”—meaning instructions that look like valid tasks the agent should know how to do.
2. The Bootstrap Transformer
Once we have a list of valid, relevant instructions (\(I_1, I_3, ...\)), we need to know the correct actions to perform them.
This is where the Bootstrap Transformer comes in. This is a standard model trained on the training data. While this model might fail on the complex test query (which requires productive generalization), it is generally capable of solving the simpler, generated instructions from step 1.
The Bootstrap Transformer acts as a simulator, generating the action sequences (\(A_1, A_3, ...\)) for the generated instructions within the current state.
3. In-Context Learning (ICL)
Now, the agent has a set of perfectly tailored examples: instructions relevant to the query, paired with correct action sequences, all taking place in the current world state.
These generated pairs are fed into the final ICL Transformer along with the original difficult query.

As illustrated in Figure 2, the architecture concatenates the support examples (State, Instruction, Action) with the Query (State, Instruction). The Encoder processes this massive context, and the Decoder generates the final target actions.
By seeing examples of “how to pull” and “how to spin” in the current room layout, the model can infer how to “pull while spinning,” even if it has never done so before.
Experimental Results
The researchers evaluated DemoGen on gSCAN, a benchmark specifically designed to break models that rely on simple pattern matching. The most notorious challenge in gSCAN is Split H.
In Split H, the model must perform a task like “pull a [object] while spinning.” In the training data, the model has seen “pulling” and it has seen “spinning” (with other verbs like pushing), but it has never seen “pulling while spinning.” This requires true productive generalization.
Quantitative Performance
The results were striking.

Looking at Table 2, pay close attention to row H:
- No ICL (TF): The baseline Transformer fails completely (0.22 success rate).
- Retrieval (CovR / GandR): Methods that try to find existing examples struggle (0.56 and 0.17). They cannot find examples of “pulling while spinning,” and retrieving examples from different states confuses the agent.
- DemoGen: Achieves a score of 0.80, drastically outperforming retrieval baselines.
This result validates the core hypothesis: when the task requires composing concepts in a novel way, showing the model generated examples of the components (e.g., pulling here, spinning here) allows it to synthesize the solution.
Why Does It Work? Support Analysis
To understand why DemoGen succeeds where retrieval fails, the authors analyzed the content of the support sets.

Table 3 reveals the quality of the demonstrations.
- Row (2) Agent Pos: DemoGen (DG) always provides examples with the correct agent position (1.00), because it generates them in the current state. Retrieval methods (CovR/GandR) often retrieve examples from different states, leading to position mismatches.
- Row (6) & (7): DemoGen consistently generates supports that contain the correct Verb and Adverb (separately or together). Retrieval methods often miss the adverb entirely because “pull while spinning” doesn’t exist in the database.
Scaling to Natural Language
One criticism of grid-world datasets like gSCAN is that the language is robotic (“walk to small red square”). To prove DemoGen isn’t just exploiting synthetic patterns, the authors used GPT-3.5 to paraphrase the dataset into more natural, varied English (NL-gSCAN).

Figure 6 shows the shift in linguistic complexity. The original gSCAN (orange) has a tiny vocabulary. The new NL-gSCAN (blue) follows a Zipfian distribution typical of natural language.
Even on this harder, more linguistically diverse dataset, DemoGen maintained its lead.

As shown in Table 4, on the natural language version of Split H, the baseline drops to 0.19 and retrieval (GandR) drops to 0.17. DemoGen maintains a strong 0.59. While lower than the synthetic score, it shows the method is robust to linguistic variation.
Ablations: What Matters?
The paper includes several ablation studies to pinpoint exactly which parts of the generated demonstrations are valuable.

Table 6 offers a fascinating insight into the “logic” of the model. The researchers tried removing specific types of supports from the set provided to the model:
- Removing Target Object: Small performance drop (0.13). The model doesn’t desperately need to see the exact object to understand the task.
- Removing Verb (e.g., Pull): Massive drop (0.59). If the model doesn’t see an example of “pulling” in the current context, it fails.
- Removing Adverb (e.g., While Spinning): Massive drop (0.50).
This confirms that DemoGen works by effectively decomposing the problem. It provides the model with “unit tests” for the verb and the adverb separately within the current environment, allowing the model to stitch them together for the final query.
Conclusion
The “DemoGen” paper highlights a significant shift in how we think about few-shot learning. For a long time, the paradigm has been Retrieve \(\rightarrow\) Solve. This works well for text-only tasks where the internet contains a sentence similar to almost anything you might say.
However, in grounded environments—robotics, virtual agents, and multimodal systems—the state space is too vast to rely on retrieval. You will never find a training example that perfectly matches your current messy living room and the specific complex command you just gave your robot.
Spilsbury et al. demonstrate that the future of grounded learning likely lies in Generate \(\rightarrow\) Solve. By giving agents the ability to imagine relevant sub-tasks and solve them mentally (via the Bootstrap Transformer) before attempting the main task, we unlock a level of compositional generalization that static datasets simply cannot provide.
Key Takeaways
- Retrieval fails in grounded settings: Finding “similar” examples is ineffective when the environment state changes.
- Synthesis > Retrieval: Generating custom support examples allows the model to see relevant concepts (verbs, adverbs) executed in the current context.
- Compositional Generalization is unlocked: By seeing the components of a complex command demonstrated separately, the model can productively generalize to novel combinations.
This approach paves the way for more autonomous agents that don’t just memorize instructions, but actively reason about how to apply their skills to new, unseen challenges.
](https://deep-paper.org/en/paper/file-3134/images/cover.png)