Why Your Robot Fails: Fixing Imitation Learning with Causal Data Curation

If you have been following the explosion of Large Language Models (LLMs), you are likely familiar with the “scaling laws” hypothesis: more data generally leads to better performance. However, as models grow, a nuanced corollary has emerged—data quality matters just as much, if not more, than data quantity. In the world of robotics, this lesson is proving to be even more critical, and significantly harder to implement.

In robot Imitation Learning (IL), we train policies to copy human demonstrations. But not all demonstrations are created equal. Some are messy, some rely on strategies that don’t generalize, and some contain “spurious correlations” (like a robot learning to grab an object only when the table is white).

Traditionally, roboticists have relied on intuition or heuristics to clean their data. They might ask, “Is this trajectory smooth?” or “Did the human operator make a mistake here?” But a new paper titled “CUPID: Curating Data your Robot Loves with Influence Functions” argues that our human intuition about data quality is often wrong. Instead, the authors propose a mathematical framework to causally link specific training examples to the robot’s actual success or failure in the real world.

Figure 1: We present CUPID, a robot data curation method that leverages influence functions to predictively answer counterfactual questions about the effect of each demonstration on downstream policy performance.

The Problem: The Disconnect Between Training and Reality

To understand why data curation in robotics is so difficult, we first need to look at how robots are typically trained. The standard approach is Behavior Cloning (BC).

In BC, we collect a dataset of expert trajectories (sequences of states and actions). We then train a neural network (the policy) to minimize a loss function—essentially, we punish the network if its predicted action differs from the expert’s action.

Here lies the trap. In standard computer vision or NLP, if your validation loss goes down, your model is usually getting better. In robotics, low training loss does not guarantee task success. A robot can perfectly memorize a human’s motions but fail catastrophically because of:

Compound Errors: A small deviation early in a task puts the robot in a state it hasn’t seen, leading to further errors.
Brittle Strategies: The robot might learn a strategy that works in the training setup but fails if the lighting changes or the object is moved slightly.
Spurious Correlations: The robot might attend to background features rather than the object itself.

The authors of CUPID posit that we shouldn’t curate data based on what looks “good” to a human. We should curate data based on what maximizes the policy’s Expected Return (i.e., the probability of solving the task).

The Solution: Causal Attribution via Influence Functions

The core contribution of this paper is a method called CUPID (CUrating Performance-Influencing Demonstrations). The goal is to answer a counterfactual question: “How would my robot’s success rate change if I removed this specific demonstration from the training set?”

If removing a demonstration makes the robot better (or doesn’t change performance), that data was likely harmful or redundant. If removing it makes the robot worse, that data was crucial.

What are Influence Functions?

To answer this without retraining the model thousands of times (which would be prohibitively expensive), the authors utilize Influence Functions.

Originating from robust statistics, influence functions allow us to approximate how the model’s parameters would change if a training point were upweighted or removed. In standard deep learning, we look at how a training point influences the test loss.

However, CUPID needs to look at how a training point influences the Expected Return (\(J(\pi_\theta)\)), not just the test loss.

Figure 2: Data curation with CUPID. Upon training a policy on a set of demonstrations using behavior cloning, we evaluate it online to collect closed-loop rollout trajectories and estimate the policy’s expected return. CUPID ranks demonstration based on their measured influence on this performance estimate and selects the top-k.

As shown in the workflow above, the process works like this:

Train: Train a baseline policy using Behavior Cloning on all available data.
Evaluate: Run the robot in the environment (or simulator) to collect “rollouts.” Some will succeed, some will fail.
Attribute: Use CUPID to calculate how much each original training demonstration contributed to those successes or failures.

The Mathematical Engine

The challenge is that the environment’s dynamics are unknown and non-differentiable—you can’t simply take the gradient of the “Success” boolean with respect to the neural network weights.

To solve this, the authors combine influence functions with the “log-derivative trick” (commonly used in the REINFORCE algorithm in Reinforcement Learning).

They define Performance Influence (\(\Psi_{\pi\text{-inf}}\)) as the derivative of the expected return with respect to the weight of a training demonstration. The authors derive a powerful decomposition for this:

Equation showing the decomposition of performance influence into expected return and action influences.

Let’s break down this equation (Eq. 3 in the paper):

\(\Psi_{\pi\text{-inf}}(\xi)\): The “score” of a specific training demonstration \(\xi\).
\(R(\tau)\): The return of a rollout (e.g., +1 for success, -1 for failure).
\(\Psi_{a\text{-inf}}\): The Action Influence. This measures how much the training demonstration \(\xi\) pushed the policy to take the specific action \((s', a')\) seen during the rollout.

In plain English: CUPID looks at a rollout. If the rollout was a success (\(R=1\)), it looks at every action the robot took. If a training demonstration heavily influenced the robot to take those “winning” actions, that demonstration gets a positive score. Conversely, if the rollout was a failure (\(R=-1\)), and a training demonstration encouraged the actions that led to that failure, that demonstration gets a negative score.

This is a profound shift from heuristic filtering. It doesn’t matter if a human thinks a demonstration looks “messy.” If that messy demonstration encourages actions that lead to success, CUPID identifies it as valuable.

Experimental Results: When Heuristics Fail

The researchers validated CUPID across simulated benchmarks (RoboMimic) and real-world tasks using a Franka Emika robot. The results highlight exactly why automated, performance-based curation is necessary.

1. The “Quality” Trap (Mixed-Quality Data)

In the RoboMimic simulation, the researchers used datasets mixed with “low quality” (suboptimal) human demonstrations. They compared CUPID against “DemInf” (a method that filters based on mutual information/predictability) and a “Quality Oracle” (using ground truth labels).

Figure 3: RoboMimic mixed-quality curation results. Top: Data Quality. Bottom: Policy Performance. Diffusion policies trained on data curated by CUPID achieve higher success rates than baselines.

The results in Figure 3 reveal a striking paradox. Look at the top row: baselines like DemInf are very good at finding “high quality” data (the curves go up). However, look at the bottom row (Policy Success Rate). Higher “quality” data does not always mean higher success.

CUPID (orange lines) consistently selects data that results in high success rates, often outperforming methods that strictly optimize for human-perceived quality. This confirms that current state-of-the-art models (like Diffusion Policies) might actually benefit from some “suboptimal” data that provides better coverage or recovery behaviors.

2. Identifying Robust Strategies (The “TuckBox” Task)

The real-world experiments provided the most compelling narrative for CUPID. Consider the “TuckBox” task: the robot must slide a box under a shelf.

Strategy A (Sliding): The operator slides the box. This is smooth, easy, and looks “high quality.”
Strategy B (Pick-and-Place): The operator picks up the box and places it. This is jerkier and looks “lower quality.”

The catch? In the test environment, the friction or mass of the box might change, making Sliding unreliable (brittle), while Pick-and-Place remains robust.

Heuristic methods (and even human annotators) tend to favor the smooth Sliding demos. But when CUPID analyzes the rollouts (where sliding often fails), it correctly identifies that the Pick-and-Place demonstrations are the drivers of success.

Figure 5: Franka diffusion policy curated dataset distributions for filtering. (b) TuckBox: Distribution of curated demonstrations after filtering 66%. Pick-and-place demos are better.

As shown in the chart above (center), CUPID (and the Oracle) realizes that the “Pick-and-Place” strategy (blue) is the robust one, retaining it despite its apparent lower visual quality. Baselines like DemInf (which looks for predictability) almost exclusively keep the brittle “Sliding” strategy (red), leading to a 0% success rate in the real world (as seen in Figure 4 of the paper).

3. Fighting Spurious Correlations (The “Bookshelf” Task)

In another experiment, the robot had to pull a book from a shelf.

Scenario A: Target book is alone. Background is white. (Horizontal pull works).
Scenario B: Target book has a weight on top. Background is dark. (Horizontal pull fails; Vertical pull required).

The dataset was imbalanced: most “Horizontal pull” demos happened on a white background. The robot learned a spurious correlation: White background \(\rightarrow\) Pull Horizontally.

When deployed in a test setting with a white background but a weighted book, the standard policy failed. CUPID analyzed the failures and realized that the “Horizontal Pull + White Background” demos were negatively influencing the policy in those edge cases. By filtering them out, it forced the robot to learn the true causal mechanism (the weight on the book), not the background color.

Broader Implications

One of the most exciting results in the paper is that data curated by CUPID isn’t just useful for the specific model used to curate it. The authors showed that datasets curated for a standard Diffusion Policy could be used to fine-tune a massive, generalist Vision-Language-Action (VLA) model (\(\pi_0\)).

Figure 7: Data curated for single-task diffusion policies improves pi_0 post-training performance.

This suggests a scalable workflow for the future of robotic learning:

Train a smaller, cheaper “scout” policy.
Use CUPID to curate the dataset using this scout.
Use the cleaned, high-performance dataset to train a massive foundation model.

Conclusion

The “CUPID” paper challenges the robotics community to stop looking at data filtering as a preprocessing step based on static heuristics. By treating data curation as an optimization problem targeting closed-loop performance, we can identify which demonstrations actually teach the robot to succeed.

The key takeaways for students and practitioners are:

Trust Outcomes, Not Aesthetics: A “clean” demonstration that teaches a brittle strategy is worse than a “messy” demonstration that teaches robustness.
Causality is Key: We need methods that link training data to test-time return, not just test-time loss.
Less Can Be More: Training on a subset of influential data often yields better policies than training on the full dataset, especially when the full dataset contains conflicting or spurious behaviors.

As we move toward general-purpose robots, tools like CUPID will likely become standard components of the “DataOps” pipeline, ensuring that our robots are fed not just more data, but the right data.

The Problem: The Disconnect Between Training and Reality#

The Solution: Causal Attribution via Influence Functions#

What are Influence Functions?#

The Mathematical Engine#

Experimental Results: When Heuristics Fail#

1. The “Quality” Trap (Mixed-Quality Data)#

2. Identifying Robust Strategies (The “TuckBox” Task)#

3. Fighting Spurious Correlations (The “Bookshelf” Task)#

Broader Implications#

Conclusion#