Humans are masters of “compositional generalization.” If you know what “spinning” means, and you know what “pulling a red lever” means, you can instantly understand the command “pull the red lever while spinning,” even if you have never physically performed that specific combination of actions before. You don’t need to see a tutorial for every possible combination of words and actions; you understand the components and the rules for combining them.

In the world of Artificial Intelligence, however, this remains a stubborn hurdle. Deep learning models often struggle to generalize to new combinations of known concepts. This problem becomes even harder in Grounded Language Learning, where an agent (like a robot or a virtual avatar) must interpret instructions relative to a visual world state.

A popular technique called In-Context Learning (ICL)—where the model is given a few “support” examples before being asked to solve a query—has shown promise. But there is a catch: standard ICL relies on retrieving good examples from the training data. What happens when the training data doesn’t contain the specific scenario you are facing?

In the paper “Generating Demonstrations for In-Context Compositional Generalization in Grounded Language Learning,” researchers from Aalto University propose a novel solution called DemoGen. Instead of searching for imperfect examples in a database, their agent generates its own support examples tailored to the current situation.

This post explores how DemoGen works, why retrieval fails in grounded settings, and how generating synthetic demonstrations can unlock state-of-the-art performance on complex generalization benchmarks.

The Problem: Grounded Compositional Generalization

To understand the innovation of DemoGen, we first need to define the specific type of difficulty the researchers are tackling.

Inductive vs. Productive Generalization

Compositional generalization generally falls into two buckets:

  1. Inductive: The model sees a new combination of inputs but produces a known output symbol.
  2. Productive: The model must produce a novel combination of output symbols.

Productive generalization is significantly harder. Imagine a robot that knows how to WALK and how to PUSH. If you ask it to “walk and push,” it must generate a sequence WALK PUSH that it might never have produced during training.

The “Grounded” Challenge

In grounded learning, the correct action depends on the state of the world. The command “walk to the yellow square” results in a completely different sequence of motor actions depending on where the agent is standing and where the square is located.

This makes the standard approach to In-Context Learning difficult. Usually, if a model is confused, we retrieve similar examples from the training set. But in a grounded environment, finding a “similar” example is a nightmare. You might find an example of “walking to a yellow square” from the training data, but if the layout of the room in that example is different from the current room, the action sequence will be completely different. The example becomes noise rather than a helpful hint.

Why Retrieval is Not Enough

The researchers performed a deep analysis of nearest-neighbor similarity in the gSCAN dataset (a standard benchmark for grounded language learning). They looked for training examples that were similar to the test cases.

Figure 3: Average state nearest neighbour similarity (between the shown split and the training split) for each split. X-axis is log-scale.The tables show the average hamming similarity between points in a given split and their Nth nearest neighbour in the training split. TR refers to training split.

As shown in Figure 3, the similarity drops off rapidly. Even the closest neighbors in the training set often have significant differences in the environment layout (Hamming similarity often below 0.8).

This confirms a critical hypothesis: You cannot simply retrieve your way out of the productive generalization problem. If the specific state-instruction pair doesn’t exist in the training data, retrieving the “nearest” one often yields an example where the agent is in the wrong place or the obstacles are different, leading to irrelevant action sequences.

The Solution: DemoGen

If the perfect support examples don’t exist in the training data, we must create them. This is the core philosophy of DemoGen.

The method operates in a three-stage pipeline, designed to generate a “curriculum” of examples for the current specific query.

Figure 1: Generating demonstrations on gSCAN with DemoGen for an ICL Transformer (Figure 2).The Instruction Generator takes as input the current state and I_q and produces similar instructions I_1, … I_n likely to occur in the same state,sorted by likelihood (parens).A Bootstrap Transformer trained on the training data generates the corresponding actions A_1…A_n in that state.

1. The Instruction Generator

The process starts with the Instruction Generator (seen in the middle of Figure 1). This is a masked language model (specifically, a BART-like architecture). It takes the current query instruction (e.g., “spin and pull a small yellow cylinder”) and randomly masks parts of it. It then reconstructs new instructions that are relevant to the current visual scene.

The goal is to generate instructions that are similar to the query but simpler or slightly different. For a query like “pull the yellow cylinder,” it might generate:

  • “Pull a small yellow cylinder”
  • “Walk to a yellow cylinder”
  • “Pull a red cylinder”

Crucially, the authors use a scoring model to filter these generated instructions, keeping only those that are “in-distribution”—meaning instructions that look like valid tasks the agent should know how to do.

2. The Bootstrap Transformer

Once we have a list of valid, relevant instructions (\(I_1, I_3, ...\)), we need to know the correct actions to perform them.

This is where the Bootstrap Transformer comes in. This is a standard model trained on the training data. While this model might fail on the complex test query (which requires productive generalization), it is generally capable of solving the simpler, generated instructions from step 1.

The Bootstrap Transformer acts as a simulator, generating the action sequences (\(A_1, A_3, ...\)) for the generated instructions within the current state.

3. In-Context Learning (ICL)

Now, the agent has a set of perfectly tailored examples: instructions relevant to the query, paired with correct action sequences, all taking place in the current world state.

These generated pairs are fed into the final ICL Transformer along with the original difficult query.

Figure 2: The model architecture for sequence-to-sequence ICL. Each support state S_1, … , S_n , support instruction I_1, … , I_n and corresponding support targets A_1, … , A_n , as well as the query state S_q and query instruction I_q are usedas inputs toa Transformer Encoder (along with positional encoding).

As illustrated in Figure 2, the architecture concatenates the support examples (State, Instruction, Action) with the Query (State, Instruction). The Encoder processes this massive context, and the Decoder generates the final target actions.

By seeing examples of “how to pull” and “how to spin” in the current room layout, the model can infer how to “pull while spinning,” even if it has never done so before.

Experimental Results

The researchers evaluated DemoGen on gSCAN, a benchmark specifically designed to break models that rely on simple pattern matching. The most notorious challenge in gSCAN is Split H.

In Split H, the model must perform a task like “pull a [object] while spinning.” In the training data, the model has seen “pulling” and it has seen “spinning” (with other verbs like pushing), but it has never seen “pulling while spinning.” This requires true productive generalization.

Quantitative Performance

The results were striking.

Table 2: Success rates on reference datasets for different splits.Numbers are +/- standard deviation over 1O seeds, measured after 3Oo,oO steps. Variances are shown only for retrieval and generation experiments and are negligible on other experiments. Algorithmic,Retrieval and Generation all use ICL Transformer as the architecture, with supports generated by each method.TF is a Transformer baseline and FTis the same Transformer fine-tuned on generated demonstrations from DemoGen.Best non-oracle results bolded.

Looking at Table 2, pay close attention to row H:

  • No ICL (TF): The baseline Transformer fails completely (0.22 success rate).
  • Retrieval (CovR / GandR): Methods that try to find existing examples struggle (0.56 and 0.17). They cannot find examples of “pulling while spinning,” and retrieving examples from different states confuses the agent.
  • DemoGen: Achieves a score of 0.80, drastically outperforming retrieval baselines.

This result validates the core hypothesis: when the task requires composing concepts in a novel way, showing the model generated examples of the components (e.g., pulling here, spinning here) allows it to synthesize the solution.

Why Does It Work? Support Analysis

To understand why DemoGen succeeds where retrieval fails, the authors analyzed the content of the support sets.

Table 3: Fraction of supports matching criteria from on each generation method on Split H. Omitted is Heuristic, which is 1.0 in every category. (6)-(8)are calculated based on whether any support in a query’s supports match that criteria. Other splits are shown in Appendix F

Table 3 reveals the quality of the demonstrations.

  • Row (2) Agent Pos: DemoGen (DG) always provides examples with the correct agent position (1.00), because it generates them in the current state. Retrieval methods (CovR/GandR) often retrieve examples from different states, leading to position mismatches.
  • Row (6) & (7): DemoGen consistently generates supports that contain the correct Verb and Adverb (separately or together). Retrieval methods often miss the adverb entirely because “pull while spinning” doesn’t exist in the database.

Scaling to Natural Language

One criticism of grid-world datasets like gSCAN is that the language is robotic (“walk to small red square”). To prove DemoGen isn’t just exploiting synthetic patterns, the authors used GPT-3.5 to paraphrase the dataset into more natural, varied English (NL-gSCAN).

Figure 6: Word frequency distribution of NL-gSCAN and gSCAN, each compared to the best fiting Zipf distribution probability density function. gSCAN words are in orange and NL-gSCAN words are in blue (comprising of the larger vocabulary).

Figure 6 shows the shift in linguistic complexity. The original gSCAN (orange) has a tiny vocabulary. The new NL-gSCAN (blue) follows a Zipfian distribution typical of natural language.

Even on this harder, more linguistically diverse dataset, DemoGen maintained its lead.

Table 4: Success rates for a non-ICL Transformer (TF) retrieval baselines and DemoGen on NL-gSCAN. Best results bolded.

As shown in Table 4, on the natural language version of Split H, the baseline drops to 0.19 and retrieval (GandR) drops to 0.17. DemoGen maintains a strong 0.59. While lower than the synthetic score, it shows the method is robust to linguistic variation.

Ablations: What Matters?

The paper includes several ablation studies to pinpoint exactly which parts of the generated demonstrations are valuable.

Table 6: DemoGen Split H success rate with 16 supports, but excluding specified supports.

Table 6 offers a fascinating insight into the “logic” of the model. The researchers tried removing specific types of supports from the set provided to the model:

  • Removing Target Object: Small performance drop (0.13). The model doesn’t desperately need to see the exact object to understand the task.
  • Removing Verb (e.g., Pull): Massive drop (0.59). If the model doesn’t see an example of “pulling” in the current context, it fails.
  • Removing Adverb (e.g., While Spinning): Massive drop (0.50).

This confirms that DemoGen works by effectively decomposing the problem. It provides the model with “unit tests” for the verb and the adverb separately within the current environment, allowing the model to stitch them together for the final query.

Conclusion

The “DemoGen” paper highlights a significant shift in how we think about few-shot learning. For a long time, the paradigm has been Retrieve \(\rightarrow\) Solve. This works well for text-only tasks where the internet contains a sentence similar to almost anything you might say.

However, in grounded environments—robotics, virtual agents, and multimodal systems—the state space is too vast to rely on retrieval. You will never find a training example that perfectly matches your current messy living room and the specific complex command you just gave your robot.

Spilsbury et al. demonstrate that the future of grounded learning likely lies in Generate \(\rightarrow\) Solve. By giving agents the ability to imagine relevant sub-tasks and solve them mentally (via the Bootstrap Transformer) before attempting the main task, we unlock a level of compositional generalization that static datasets simply cannot provide.

Key Takeaways

  1. Retrieval fails in grounded settings: Finding “similar” examples is ineffective when the environment state changes.
  2. Synthesis > Retrieval: Generating custom support examples allows the model to see relevant concepts (verbs, adverbs) executed in the current context.
  3. Compositional Generalization is unlocked: By seeing the components of a complex command demonstrated separately, the model can productively generalize to novel combinations.

This approach paves the way for more autonomous agents that don’t just memorize instructions, but actively reason about how to apply their skills to new, unseen challenges.