Introduction

Imagine you are trying to teach a robot how to make a cup of coffee. You show it a few examples—maybe five or ten demonstrations of you grinding beans and pouring water. For a modern machine learning model, this small handful of examples is nowhere near enough to learn a robust policy. The robot might learn to move its arm, but it won’t understand how to handle slight variations in the mug’s position or changes in lighting.

To solve this, roboticists often turn to Imitation Learning (IL) augmented by large-scale datasets. The logic is simple: if we have a massive library of robot data (like the DROID dataset, which contains thousands of hours of robot manipulation), surely we can find old experiences that are “similar enough” to making coffee and use them to help the robot learn faster.

But here lies the trap: What does “similar” actually mean?

Does a task look similar because the robot is holding a mug? Or does it look similar because the robot is making a circular stirring motion? If you search your database based on visual similarity, you might retrieve a video of a robot washing a mug—visually similar, but the wrong action. If you search based on motion, you might retrieve a video of a robot stirring a pot of soup—correct motion, but the wrong object physics.

This is the core problem addressed by the research paper “COLLAGE: Adaptive Fusion-based Retrieval for Augmented Policy Learning”. The researchers argue that relying on a single definition of similarity (like just vision or just motion) is brittle and leads to “negative transfer”—where bad data actually makes the robot perform worse.

In this post, we will deep dive into COLLAGE, a new framework that doesn’t just guess which data is relevant. Instead, it uses multiple “experts” (vision, motion, shape, language) to retrieve data and then mathematically calculates which expert is right for the specific task at hand.

Figure 1: Different tasks benefit from retrieval based on different modalities. For the target task “Open the book” (left), retrieval based on visual similarity works well. However, for “Stir the bowl” (right), visual similarity fails, while motion similarity succeeds.

As shown in Figure 1, different tasks require different retrieval strategies. COLLAGE solves this by adaptively fusing these strategies together. Let’s explore how it works.

Background: The Few-Shot Imitation Challenge

Before dissecting the solution, we need to understand the setup. The goal is Few-Shot Imitation Learning. We have:

Target Dataset (\(D_{target}\)): A tiny set of expert demonstrations for the new task we want to solve (e.g., 5 demos).
Prior Dataset (\(D_{prior}\)): A massive, uncurated offline dataset containing thousands of unrelated or loosely related behaviors.

The standard approach to this problem is Retrieval-Augmented Policy Learning. You take your target demos, embed them into a feature space (using a neural network), and search the prior dataset for the “nearest neighbors.” You then combine your target data with this retrieved data to train a policy.

The Failure of Single-Modality Heuristics

Previous methods have tried using:

Visual Retrieval: Using models like ResNet or DINO to find images that look the same.
Motion Retrieval: Using optical flow to find trajectories that move the same way.
Language Retrieval: Using text embeddings to find tasks with similar instructions.

The problem is that these are static heuristics. If you decide to use Visual Retrieval, you are betting that appearance is the most important factor for every task you will ever encounter. As we saw in the introduction, that bet often loses. If you are trying to “open a book,” visual similarity works great. If you are trying to “stir a bowl,” visual similarity might give you “place apple in bowl”—which destroys your policy’s performance.

COLLAGE: The Core Method

COLLAGE stands for COLLective data AGgrEgation. The philosophy behind COLLAGE is that we shouldn’t force the robot to choose one type of similarity. Instead, we should let the data decide which similarity metric is useful for the current task.

The method follows a three-step pipeline:

Multi-Modal Retrieval: Retrieve candidate data using every available metric.
Relevance Estimation: Calculate a “trust score” for each metric based on how well it predicts the target demos.
Adaptive Sampling: Train the final policy by sampling heavily from the trusted metrics and ignoring the bad ones.

Let’s break down the architecture.

Figure 2: Overview of our proposed COLLAGE approach. It separates the process into Retrieval (Left), Relevance Estimation (Center), and Policy Training (Right).

Step 1: Retrieval Across Multiple Modalities

First, COLLAGE acts as a dragnet. It assumes we have access to \(k\) different feature encoders, where each encoder represents a different “view” of the world. In this paper, the authors use four distinct modalities:

Visual (DINOv2): Captures high-level semantic appearance (e.g., “there is a mug here”).
Motion (Optical Flow): Captures the dynamics of the scene (e.g., “pixels are moving in a circular pattern”).
Shape (PointNet++): Captures the 3D geometry of the scene via point clouds, ignoring texture and color.
Language (OpenAI Embeddings): Captures the semantic intent of the instruction (e.g., “open the door”).

For every demonstration in the small target dataset, the system queries the massive prior dataset using all four of these methods independently.

To ensure the retrieval is precise, they use Subsequence Dynamic Time Warping (S-DTW). Because a relevant behavior (like grasping a handle) might be buried inside a long, unrelated trajectory in the prior dataset, S-DTW allows the system to match the target demo against small sub-segments of the prior data.

The output of Step 1 is four separate buckets of retrieved data: \(D_{visual}\), \(D_{motion}\), \(D_{shape}\), and \(D_{language}\).

Step 2: Estimating Relevance Weights

This is the most innovative part of the paper. We have four buckets of data. Some buckets might contain gold; others might contain garbage. How do we know which is which without manually checking?

COLLAGE uses a “rollout-free” mechanism to estimate quality. It asks: If I trained a policy solely on this bucket of data, how surprised would it be by the expert target demonstrations?

For each retrieved subset (e.g., the Motion subset), the system trains a lightweight “Reference Policy” (\(\pi_{ref}\)). This is a simple Behavior Cloning (BC) model trained only on the data inside that bucket.

Equation 1: Training the reference policy on retrieved data.

Once this reference policy is trained, we test it against our “ground truth”—the few target demonstrations we have. We feed the states from the target demos into this reference policy and check the probability it assigns to the actions the expert actually took.

This is calculated as the log-likelihood:

Equation 2: Calculating the relevance score based on log-likelihood of target demos.

If the Motion bucket contains high-quality, relevant data, the policy trained on it will likely predict actions that are very close to the target expert’s actions, resulting in a high log-likelihood (a high score \(S_f\)). If the Visual bucket contains unrelated tasks, the policy trained on it will predict the wrong actions, resulting in a low log-likelihood.

Finally, these scores are normalized using a softmax function to create a set of weights (\(w_f\)). These weights represent the probability that a specific modality is useful for the current task.

Step 3: Retrieval Augmented Policy Learning

Now that we have the data and the weights, we can train the final robot policy.

The authors use a Transformer-based policy (similar to standard robotic transformers). However, instead of dumping all the retrieved data into the training buffer uniformly, they use Importance Sampling.

During training, when the system creates a batch of data, it samples trajectories according to the weights calculated in Step 2. If the “Shape” modality got a weight of 0.6 and “Language” got 0.1, the training batches will contain 6x more data from the Shape bucket than the Language bucket.

This effectively filters out the noise. The policy pays attention to the data that has been mathematically proven to align with the target task, while largely ignoring the data that would cause negative transfer.

The final objective function is a standard imitation learning loss, but applied over this weighted distribution of data:

Equation 3: The final policy training objective.

Experimental Results

The authors evaluated COLLAGE in two distinct environments: the LIBERO simulation benchmark and a Real-World setup using a Franka Emika Panda robot.

Simulation (LIBERO)

In the simulation, the system was tested on 10 diverse tasks. The results were compared against single-modality baselines (using just Visual or just Motion retrieval) and non-retrieval baselines (standard BC and Multi-Task learning).

Key Finding: COLLAGE outperformed the best single-modality baseline by 11.2%.

More interestingly, the experiments validated the hypothesis that no single feature is “best.”

For the task “Stove-Moka”, the Visual modality was dominant.
For “Soup-Cheese”, the Motion modality was essential.
For “Cheese-Butter”, the Shape modality carried the most weight.

COLLAGE was able to automatically identify these preferences. We can visualize this adaptive weighting in the chart below:

Figure 4: Modality importance weights predicted by COLLAGE. Notice how the distribution of colors (modalities) changes drastically from task to task.

Looking at the weights in Figure 4, we see that for some tasks (like the first pie chart), the weights are relatively balanced. For others, one modality dominates. This flexibility is what allows COLLAGE to be robust.

Real-World Evaluation (DROID Dataset)

The real-world experiments were particularly challenging. The authors used DROID, a massive, diverse dataset collected across many different labs and environments.

The challenge here is the Domain Gap. The DROID data looks very different from the authors’ lab setup. The lighting is different, the tables are different, and the background clutter is different.

Figure 3: Real-world tasks and retrieved examples from DROID. The top row shows the target task in the authors’ lab. The bottom rows show the retrieved data from DROID, which is visually distinct but semantically relevant.

Despite these visual differences, COLLAGE was able to retrieve useful behaviors. For example, in the “Stir the Bowl” task, the visual gap was huge, but the Motion and Language modalities successfully retrieved stirring motions from DROID.

Quantitative Success: In the real world, standard Behavior Cloning (BC) failed almost completely (6.7% success rate) because 5 demonstrations are simply not enough for complex manipulation.

Visual Retrieval improved this to 28.9%.
COLLAGE achieved 45.5% success rate.

This is a massive relative improvement, demonstrating that fusing diverse data sources is critical for real-world robotics where conditions are messy.

We can see the breakdown of the learned weights for specific tasks in Table 3:

Table 3: Importance weights for various tasks.

Notice the “Lego” task in the DROID section (Real World). The Visual weight is very high (0.6), while Motion is near zero (0.02). This makes sense: stacking Lego blocks is a precise, static alignment task where the visual configuration of the studs matters more than the velocity of the arm. COLLAGE successfully realized that motion data was noisy for this specific task and suppressed it.

Why Does It Work?

To understand the “why” at a deeper level, the authors visualized exactly what was being retrieved.

Figure 5: Visualization of retrieved demonstration types.

In Figure 5, we look at the breakdown for the “Book” task (top row).

Visual Retrieval (Top-Left Pie): Retrieves mostly “book-front” scenes. It focuses on the object.
Language Retrieval (Top-Right Pie): Retrieves a mix of tasks that share semantic instructions.

By combining these, COLLAGE ensures that the policy sees examples that match the object (Visual) AND examples that match the intent (Language), filling in the gaps that any single modality would miss.

Conclusion & Implications

The COLLAGE paper presents a convincing argument that in the era of large-scale robotic learning, how we curate data is just as important as the quantity of data.

The key takeaways are:

No “Master” Feature: There is no single similarity metric that works for all robotic tasks. Sometimes geometry matters (Shape), sometimes dynamics matter (Motion), and sometimes semantics matter (Vision/Language).
Adaptive Fusion: We don’t need to manually tune these preferences. By using lightweight reference policies, we can mathematically estimate which data is valuable.
Data Efficiency: This method allows robots to learn from “wild,” unstructured datasets like DROID, even when the environment looks completely different from the training data.

This approach has broad implications for the future of generalist robots. Rather than trying to build one massive model that “knows everything,” it suggests a future where robots dynamically query their memory banks, pulling together a collage of relevant experiences to solve the problem in front of them.

Paper Reference: Kumar, S., Dass, S., Pavlakos, G., & Martín-Martín, R. (2025). COLLAGE: Adaptive Fusion-based Retrieval for Augmented Policy Learning. The University of Texas at Austin.

Introduction#

Background: The Few-Shot Imitation Challenge#

The Failure of Single-Modality Heuristics#

COLLAGE: The Core Method#

Step 1: Retrieval Across Multiple Modalities#

Step 2: Estimating Relevance Weights#

Step 3: Retrieval Augmented Policy Learning#

Experimental Results#

Simulation (LIBERO)#

Real-World Evaluation (DROID Dataset)#

Why Does It Work?#

Conclusion & Implications#