Introduction
Imagine you are teaching a robot to clean a table. You spend hours showing it how to pick up a pen and place it in a cup. You train it until it executes the motion perfectly. Then, you hand the robot a pencil. To you, the task is identical: the pencil is long, thin, and rigid, just like the pen. You intuitively understand that the pencil is “functionally analogous” to the pen.
To the robot, however, this is a nightmare. The pencil is yellow, not blue. It has a different texture. The pixel values are different. In the world of imitation learning, this is an Out-of-Distribution (OOD) scenario. Despite the task being conceptually identical, the robot’s policy often fails catastrophically because the visual input doesn’t match its training data.
This is one of the most persistent hurdles in robot learning. While we have seen massive success in large-scale computer vision and language models, robotics datasets remain relatively small and expensive to collect. We cannot realistically demonstrate every possible object a robot might encounter in the wild.
In this post, we will dive deep into a paper titled “Adapting by Analogy: OOD Generalization of Visuomotor Policies via Functional Correspondence.” The researchers propose a novel method that allows robots to handle new objects and environments without retraining. Instead of collecting new data, the robot learns to “imagine” that the new OOD object is actually a familiar In-Distribution (ID) object, guided by human hints.

As shown in Figure 1, the core idea is simple yet powerful: if a robot knows how to handle a pen, and we tell it that a pencil corresponds to a pen, it should be able to transfer that behavior directly to the new object.
The Challenge: Why Generalization is Hard
Standard end-to-end visuomotor policies (like Diffusion Policies) map pixels directly to actions. They are excellent at mimicking behaviors they have seen before. However, their reliability drops when the environment changes.
Typically, there are two ways to handle OOD scenarios:
- Scale up data: Collect millions of demonstrations covering every possible object. (Too expensive).
- Interactive Imitation Learning: When the robot fails, a human steps in, provides a new demonstration, and the robot is re-trained. (Time-consuming and computationally heavy).
The researchers behind this paper argue that we don’t always need new demonstrations. Often, the robot already knows the physical motion required to succeed. It just doesn’t realize that the current situation calls for that specific known motion. The failure isn’t in the action space; it’s in the recognition of the task’s functional nature.
The Solution: Adapting by Analogy (ABA)
The proposed method, Adapting by Analogy (ABA), is a test-time intervention strategy. It doesn’t change the weights of the neural network. Instead, it changes the input the network sees.
When the robot encounters something strange (OOD), it asks a human expert for a high-level “analogy” (e.g., “The pencil functions like the pen”). The system then searches its memory (the training dataset) for images that are functionally similar to what it is currently seeing. It then feeds the embeddings of those familiar training images into the policy instead of the confusing OOD image.
Effectively, the robot tricks itself into thinking it is seeing a pen, allowing it to execute the perfect pickup motion for the pencil.
The ABA Pipeline
The method operates in a four-stage loop during deployment:
- Detect: Check if the current observation is OOD.
- Retrieve: If it is OOD, use human feedback to find functionally corresponding images from the training set.
- Refine: Check if the retrieved images suggest a consistent behavior. If the robot is confused (high uncertainty), ask the human for clarification.
- Intervene: Replace the OOD observation with the retrieved ID observations and execute the action.

Let’s break down the technical details of these steps.
1. Detecting Anomalies
Before adapting, the robot needs to know it’s in trouble. The system uses a fast OOD detector based on cosine similarity.
The robot encodes the current observation \(\hat{o}\) into a latent embedding \(\hat{z}\) using its policy encoder. It compares this embedding against the embeddings of all observations in the training set (In-Distribution, or ID). If the similarity score drops below a certain threshold \(\lambda\), the system flags the observation as OOD and pauses to trigger the adaptation process.
2. Establishing Functional Correspondences
Once an OOD scenario is detected, the system needs to bridge the gap between the “alien” new world and its familiar training data. This is where the human expert comes in.
The expert provides a natural language description, \(l\), such as “Match the pencil tip to the pen tip.” The system then computes a Functional Correspondence Map. This map identifies specific segments in the current image and pairs them with segments in the training images based on the expert’s description.
This isn’t just global image matching; it is semantic segmentation. The system likely uses tools like Grounded Segment Anything (as mentioned in the paper’s details) to mask out specific objects.
Mathematically, the functional correspondence map \(\Phi\) is defined as a set of paired image segments \((\omega, \hat{\omega})\):

Here, \(\omega_j\) represents a segment from a training image, and \(\hat{\omega}_j\) represents a segment from the current OOD image. \(K\) is the number of corresponding segments.
To determine which training image is the best match, the system calculates a score based on the Intersection over Union (IoU) of these aligned segments. This score, \(f\), quantifies how well the geometry of the OOD object aligns with the ID object after the correspondence mapping is applied.

3. Filtering by Proprioception
Visual matching isn’t enough. A robot arm’s action depends heavily on its current position (proprioception). Even if a training image looks like the current scene, it’s useless if the robot arm in that training image is on the other side of the table.
Therefore, before doing the visual matching described above, the system filters the dataset. It only looks at training frames where the robot’s proprioceptive state \(q\) (joint angles, gripper position) is similar to the current state \(\hat{q}\).

This equation creates a subset of relevant observations \(\mathcal{O}_q\). The visual functional correspondence (from step 2) is only computed against this subset.
4. Refining Until Confident
This is a crucial step that separates ABA from simple retrieval methods. Just because you found a visual match doesn’t mean you found the right behavior.
Imagine the robot is holding a piece of trash. In the training set, it might have two behaviors for holding objects: “put in recycling” (for paper) and “put in compost” (for food). If the functional correspondence is vague, the system might retrieve a mix of “recycling” and “compost” examples.
The ABA method checks the entropy of the predicted actions from the retrieved images.
- High Entropy: The retrieved actions are all over the place. The robot is confused. It asks the expert to refine the correspondence (e.g., “No, match this specific color to the recycling bin”).
- Low Entropy: The retrieved actions are consistent (a “mode”). The robot proceeds.
5. Intervention
Finally, the system performs the intervention. It takes the top \(M\) functionally aligned training observations (\(o_1, ... o_M\)). It computes their embeddings, averages them, and feeds this “dreamed” embedding into the policy network.
The policy \(\pi\) outputs an action based on this ID embedding, effectively executing a known, safe behavior in a novel environment.
Experimental Setup
The researchers validated ABA on real hardware using a Franka Research 3 robot. They compared it against three baselines:
- Vanilla: The base policy (Diffusion Policy) with no intervention.
- PolicyEmbed: An intervention method that retrieves nearest neighbors based on the policy’s own learned embedding space (without functional correspondence).
- DINOEmbed: An intervention method that retrieves neighbors using DINOv2 features (a powerful vision foundation model).

The Tasks
They designed two distinct tasks to test generalization:
- Sweep Trash: The robot must sweep items into different zones based on their type (Organic vs. Recycling).
- Object in Cup: A precision task where the robot must pick up an object and place it into a mug. This is challenging because different objects require different grasp strategies (e.g., markers are dropped from the bottom, pens from the top).

The OOD Conditions
To test robustness, the researchers threw a variety of curveballs at the robot:
- New Backgrounds: Changing the table surface to a black cloth.
- New Objects: Replacing training objects (M&Ms, Paper, Markers) with completely new ones (Doritos, Napkins, Pencils, Batteries, Jenga blocks).

Results and Analysis
The results provided compelling evidence that functional correspondence is superior to standard visual similarity for robotic manipulation.
Task Success Rates
Figure 3 (below) summarizes the main results.
- In-Distribution (ID): ABA improves performance even here, likely by filtering out noisy behaviors.
- OOD Backgrounds: The Vanilla policy crashes in the “Object in Cup” task when the background changes (dropping to near zero success). ABA maintains high performance because it retrieves the original training images (with the original backgrounds) to drive the policy.
- OOD Objects: This is the most dramatic result. When faced with completely new objects (like the pencil), Vanilla, PolicyEmbed, and DINOEmbed all struggle significantly. ABA, however, achieves success rates comparable to the training scenarios.

Specifically, in the “OOD Objects” category for Object-in-Cup, ABA achieved over 90% success, while the Vanilla policy and other embeddings failed almost completely. This proves that visual similarity (DINO/Policy embeddings) isn’t enough; the robot needs to understand functional equivalence to generalize to new geometries.
How Often Does It Need Help?
A key concern with human-in-the-loop systems is the burden on the human. If the robot asks for help every second, it’s not autonomous.
The researchers tracked the number of feedback requests. Figure 4 shows that ABA is quite efficient. For a task taking 70-120 timesteps, the robot typically asked for feedback only 2 to 5 times per rollout.

The number of requests increases slightly for harder OOD objects (like the battery), which makes sense—the functional link is harder to establish than for a pencil/pen.
Why Does It Work?
The researchers analyzed what the system was retrieving. Figure 5 plots the precision of retrieval against success. It shows a strong correlation: when the system retrieves observations that functionally align with the ground truth (as ABA does), the task succeeds.

This confirms that the baselines (PolicyEmbed and DINOEmbed) were failing because they were retrieving the wrong training examples. For instance, DINO might match a pencil to a marker based on visual features, but if the grasp strategy for a marker is different than for a pen (which the pencil actually mimics), the robot will fail. ABA’s expert-guided functional correspondence ensures the correct “behavioral” match is found.
Conclusion and Implications
The “Adapting by Analogy” paper presents a refreshing take on generalization. Rather than trying to force a neural network to learn a universal representation of all physics and objects (which requires massive data), it acknowledges that the necessary behaviors often already exist in the training set. The challenge is simply accessing them.
By using expert feedback to define functional analogies, we can unlock “deployment-time generalization.” This approach allows a robot trained on pens to handle pencils, chopsticks, or paintbrushes without a single gradient update or retraining session.
Key Takeaways:
- Don’t Retrain, Intervene: We can steer policies by replacing OOD inputs with “dreamed” ID inputs.
- Function over Form: Visual similarity (pixel match) is often less important than functional similarity (affordance match) for robotics.
- Human-in-the-Loop: A small amount of high-level human guidance can fix robust failures that would otherwise require massive data collection.
This method opens the door for more adaptable robots in unstructured environments, suggesting that the path to general-purpose robots might not just be “more data,” but “better analogies.”
](https://deep-paper.org/en/paper/2506.12678/images/cover.png)