Introduction

We are living in the golden age of imitation learning. From robots that can cook shrimp to those that can repair themselves, we’ve seen incredible breakthroughs driven by large-scale demonstration data. However, there is a massive bottleneck hiding behind these viral videos: the cost of data.

Projects like ALOHA Unleashed or DROID required months of labor, dozens of operators, and tens of thousands of collected trajectories. If we want robots to generalize to every cup, hammer, or door handle in the world, we cannot possibly teleoperate them through every variation. We need a way to multiply our data automatically.

This is where Constraint-Preserving Data Generation (CP-Gen) comes in.

In a recent paper from Stanford and UT Austin, researchers proposed a method to take a single expert demonstration and algorithmically generate thousands of new, diverse demonstrations. Unlike previous methods that simply rotate or translate the scene, CP-Gen can handle entirely new object geometries. It creates data that teaches robots how to handle tall wine glasses versus short ones, or wide boxes versus narrow ones, enabling zero-shot transfer to the real world.

In this post, we’ll dive deep into how CP-Gen works, the math behind its constraint satisfaction, and why it outperforms existing state-of-the-art data generation methods.

The Problem: Why Rotation Isn’t Enough

To understand why CP-Gen is necessary, we first need to look at the current standard: data augmentation via SE(3) equivariance.

Methods like MimicGen take a demonstration (e.g., picking up a mug) and generate new data by changing the object’s pose (position and rotation). If the mug moves 10cm to the right, the robot’s gripper action should also move 10cm to the right. This works beautifully for identical objects.

The Catch: This approach breaks down when the object’s shape changes. Imagine you have a demonstration of hanging a short, wide wine glass on a rack. If you try to apply the same relative motion to a tall, thin wine glass, the robot might crash into the glass or miss the rack entirely. A rigid transformation (\(SE(3)\)) cannot account for geometric variations (scaling, stretching, or aspect ratio changes).

CP-Gen solves this by moving away from simple motion replay and towards keypoint-trajectory constraints.

The Core Method: CP-Gen

CP-Gen is built on a powerful insight: a robot skill shouldn’t be defined by where the gripper is, but by how the gripper relates to the object it’s manipulating.

Figure 1: CP-Gen uses one expert demonstration and keypoint-trajectory constraints to generate diverse demonstrations in simulation involving novel object geometries and poses.

As shown in Figure 1, the method takes a single source demonstration and produces a fleet of generated demos that handle novel geometries (like the spiral wine glass rack). It achieves this by ensuring that specific points on the robot (keypoints) track specific paths relative to the object, regardless of how that object is stretched or scaled.

The Workflow

The CP-Gen pipeline consists of two main stages: Source Data Processing and New Data Generation.

Figure 2: CP-Gen Method. Top: Source Data Processing creates constraints. Bottom: New Data Generation adapts those constraints to new scenes.

Stage 1: Source Data Processing

  1. Decomposition: The expert trajectory is split into “Free-space motions” (moving between objects) and “Robot skills” (interacting with objects).
  2. Annotation: The system identifies Actor Keypoints (points on the gripper or the object being held) and tracks their relationship to the target object.
  3. Constraint Extraction: Instead of saving absolute coordinates, CP-Gen saves the trajectory of these keypoints relative to the target object’s coordinate frame.

Stage 2: New Data Generation

  1. Scene Sampling: The system spawns a new scene where objects have new poses and new geometries (e.g., a mug is scaled to be taller).
  2. Adaptation: The recorded keypoint constraints are transformed to match the new geometry.
  3. Optimization: The system solves for a robot joint configuration that satisfies these new constraints.
  4. Motion Planning: A collision-free path is planned to connect the segments.

Let’s break down the mathematical machinery that makes this adaptation possible.

Step 1: Extracting Keypoint Constraints

First, we need to capture the “essence” of the skill. Let’s say the robot is grasping a handle. We define keypoints \({^A}k_i\) on the “Actor” (the gripper). We want to record where these points are relative to the “Object” (the handle).

The target trajectory in the object’s frame, \({^O}k_i(t)\), is calculated as:

Equation 1: Transforming actor keypoints into the object frame.

Here, \({^W}T_A(t)\) is the pose of the actor in the world, and \({^O}T_W(t)\) is the inverse pose of the object. Essentially, this equation “locks” the motion to the object’s local coordinate system.

Step 2: Adapting to New Geometries

Now, suppose we generate a new scene where the target object is stretched. Mathematically, this is a geometric transformation matrix \(\mathbf{X}\) (e.g., non-uniform scaling).

If we just replayed the original motion, the gripper would go to the “old” location of the handle. Instead, CP-Gen applies the transformation \(\mathbf{X}\) to the target trajectory:

Equation: Applying geometric transformation X to the target keypoints.

This simple equation is critical. It updates the target path so that if the handle gets 5cm taller, the target grasp point also moves 5cm up.

Step 3: Solving the Optimization Problem

We now have a new set of target 3D points in space that the robot needs to hit. But robots are controlled by joint angles (\(q\)), not just floating points in space. We need to find the specific joint configuration \(q_t^*\) that aligns the robot’s keypoints with these new target points.

CP-Gen solves this using an optimization problem at every time step:

Equation 2: The optimization objective for finding robot joint configurations.

This equation has two competing terms:

  1. Match Keypoints (Left Term): It minimizes the distance between the robot’s actual keypoints (calculated via Forward Kinematics \(f_{FK}\)) and the transformed target keypoints.
  2. Temporal Smoothness (Right Term): It penalizes the robot for moving its joints too much between time steps (\(\lambda ||q - q_{t-1}^*||\)). This ensures the generated motion is smooth and physically plausible, rather than jerky.

Step 4: Adding Noise

To make the trained policy robust, we don’t just want perfect demonstrations. We want the robot to learn how to recover from small errors. The authors inject “system noise” during data generation.

Equation: Injecting Gaussian noise into the action.

Interestingly, they record the clean action \(a\) for training but execute the noisy action \(a'\) in the simulation. This decouples the execution exploration from the training label, leading to smoother policies.

Experiments and Results

The researchers evaluated CP-Gen on both simulated benchmarks and physical hardware. The primary question was: Can a policy trained on data generated from ONE demo generalize to unseen geometries?

Simulation Benchmark

They tested on tasks from the MimicGen benchmark, plus a new “Geometry Generalization” variant where objects were scaled non-uniformly.

Figure 3: Simulation tasks including Stack Three, Square, and Coffee, showing geometry variations.

Using only one expert demo, they generated 1,000 synthetic demos per task. They compared CP-Gen against MimicGen (which relies on pose transforms) and DemoGen.

The Results

Table 1: CP-Gen outperforms MimicGen significantly on geometry generalization tasks.

The results in Table 1 are striking:

  • Pose Only: On standard tasks (just moving objects around), CP-Gen performs comparably to state-of-the-art methods (~85-88% success).
  • Geometry Generalization: This is the game-changer. When objects change shape, MimicGen crashes to ~33% success, while CP-Gen maintains ~73% success.

By understanding the geometry of the interaction via constraints, CP-Gen creates data that actually teaches the robot how to adapt.

Real-World Zero-Shot Transfer

Perhaps the most impressive result is the “Sim-to-Real” transfer. The team created “Digital Twins” of real-world setups (like hanging a mug or inserting a wine glass). They generated data only in simulation and trained a policy. Then, they ran that policy on a real Franka Emika Panda robot.

Figure 4: Real World Tasks including Mug Cleanup and Wine Glass Spiral Hanging.

Table 2: CP-Gen achieves strong zero-shot sim-to-real transfer rates.

As shown in Table 2, CP-Gen policies achieved an average success rate of 83% in the real world, compared to just 40% for MimicGen. The failure cases for MimicGen were telling—it often failed to insert objects because it didn’t account for how the object’s shape had changed, leading to collisions or bad grasps.

Why does geometry diversity matter?

The authors also ran an ablation study to see if training on diverse geometries actually helps.

Table 3: Ablation results showing that generating varied geometries boosts policy generalization.

Table 3(b) confirms that training with geometry diversity (made possible by CP-Gen) nearly doubles the success rate on novel objects (73% vs 45%). This suggests that the policy isn’t just memorizing coordinates; it’s learning visual cues associated with the object’s shape.

Conclusion

CP-Gen represents a significant step forward in data-efficient robot learning. By shifting the paradigm from trajectory replay to constraint satisfaction, it allows us to squeeze significantly more value out of a single human demonstration.

Key Takeaways:

  1. One Demo: You can generate thousands of robust training examples from a single expert input.
  2. Geometry Awareness: Unlike previous methods, CP-Gen handles stretching, scaling, and shape variations effectively.
  3. Sim-to-Real: The generated data is high-quality enough to train policies that work in the real world without seeing a single real image.

While the method currently relies on manual keypoint annotation (a limitation the authors acknowledge), it opens the door for future work where Vision-Language Models (VLMs) might automatically identify these constraints. For students and researchers in robotics, CP-Gen highlights the importance of embedding structure—like geometric constraints—into our data generation pipelines to achieve true generalization.