Teaching Robots to Coordinate: How Diffusion Models Are Solving Bimanual Manipulation

If you’ve ever tried to carry a heavy moving box or fold a large bedsheet with just one hand, you know the struggle. We humans rely heavily on bimanual manipulation—using two hands in coordination—to interact with the world. For robots to be truly useful in homes and warehouses, they need to master this same skill.

However, training robots to coordinate two arms is exponentially harder than training one. You have to manage higher degrees of freedom, ensure the arms don’t crash into each other, and maintain precise coordination to keep objects from falling. Traditional Imitation Learning (IL), where robots learn by mimicking human demonstrations, works well but is data-hungry. Collecting thousands of coordinated, two-arm demonstrations is costly and labor-intensive.

This brings us to a fascinating new paper: D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation. The researchers propose a way to artificially expand training datasets without needing to run the robot for hours on end. By leveraging diffusion models (the same tech behind DALL-E and Midjourney), they can synthesize realistic “fake” data that teaches the robot how to handle situations it has never actually seen before.

In this post, we’ll break down how D-CODA works, why simpler augmentation methods fail for dual-arm tasks, and how this method achieves impressive results in both simulation and the real world.

The Problem: Why Bimanual Data is Hard to Get

In visual imitation learning, robots often use “eye-in-hand” cameras—cameras mounted directly on their wrists. This is great for getting up close to the action. However, training a neural network on these images requires a massive amount of data to generalize well.

If a robot sees an object from a slightly different angle than it did during training, it might fail. Standard solution? Data Augmentation.

In computer vision, we usually just flip or crop images. But in robotics, if you change the image (the “state”), you must also know the correct robot movement (the “action”) that corresponds to that new image.

This gets tricky with two arms. If you artificially move the camera view of the left arm, you have to ensure:

The view is consistent with what the right arm is seeing.
The resulting action keeps the two arms coordinated.

If you simply jitter the cameras randomly during a task like lifting a tray, the augmented data might suggest moving the left hand away while the right hand stays put. If the robot learns this, it will drop the tray.

Enter D-CODA

D-CODA stands for Diffusion for COordinated Dual-arm Data Augmentation. It is an offline framework that takes a small set of real demonstrations and multiplies them into a larger, richer dataset.

Figure 1: Overview of D-CODA for a coordinated bimanual lifting task with two UR5 arms.

As shown in Figure 1, the system takes original wrist camera images and generates “augmented” images that look like the robot is slightly offset from its original trajectory. Crucially, it calculates the correct joint-space actions to go with these new images, ensuring the robot learns to recover from errors.

The framework consists of three main stages, visualized in the architecture diagram below:

Diffusion Model: Synthesizing the visual data.
Perturbation Sampling: Deciding “where” the cameras should move.
Policy Learning: Training the actual robot controller.

Figure 2: Overview of D-CODA. (i) The diffusion model generates novel views. (ii) Sampling strategy distinguishes between contactless and contact-rich states. (iii) Training the policy on the combined dataset.

Let’s dive deep into the two core technical innovations: the diffusion synthesis and the coordinated sampling.

1. The Diffusion Model for Novel-View Synthesis

The first challenge is visual: if we imagine the robot’s hand is 2cm to the left, what does the camera see?

The authors employ a conditional diffusion model. A diffusion model works by adding noise to an image until it’s static, and then learning to reverse the process to reconstruct a clear image. Here, the model is “conditional”—it doesn’t just hallucinate random images; it generates a specific view based on:

The original source images (\(I^l_a, I^r_a\)).
The pose transformation (\(\Delta p\))—how much the camera moved.

The model uses a VQ-GAN autoencoder to compress images into a latent space (making the process computationally feasible) and a U-Net architecture to perform the denoising.

The training objective is effectively to minimize the difference between the noise added and the noise predicted by the model:

Loss function equation for the diffusion model.

By injecting the relative camera pose information into the model’s cross-attention layers, D-CODA ensures that the generated left and right wrist images are viewpoint-consistent. This means if the left camera moves, the lighting, object perspective, and relative positions in the generated image match the physics of the scene.

2. The Logic: Constrained Sampling

This is the “brain” of D-CODA. You can generate perfect images, but if the pose perturbations (\(\Delta p\)) imply physically impossible or uncoordinated actions, the data is garbage.

The researchers use SAM2 (Segment Anything Model 2) to track the grippers and detect if the robot is touching an object. This splits the task into two modes:

Mode A: Contactless (Free Space)

If the robot is moving through empty air, coordination is less critical. The system uniformly samples random camera perturbations (green/yellow dots in Figure 2). This teaches the robot to reach the object from various angles.

Mode B: Contact-Rich (holding an object)

This is where previous methods fail. If the robot is holding a box, the left and right hands are physically coupled. You cannot move one without the other, or without moving the object.

To solve this, D-CODA treats the perturbation sampling as a constrained optimization problem. It looks for a pose change \(c_{trans}\) that satisfies specific rules suitable for bimanual tasks.

Optimization constraints for camera perturbation sampling.

The constraints (shown above) ensure:

Safety: The new pose is not inside the table (\(d_{table}\)) or colliding with the other arm (\(d_{eff}\)).
Feasibility: The Inverse Kinematics (IK) solver confirms the robot arms can actually reach this new position.
Coordination: Crucially, when in contact, identical perturbations are often applied or constrained to maintain the relative distance between grippers.

To visualize why this matters, look at the comparison below.

Figure 9: Visualization comparing constraint-enforced actions and random actions.

In Figure 9, notice the “Random Actions” (blue arrows). If the augmentation simply jitters the position randomly, the end-effectors move apart, increasing the distance between them. In reality, this would cause the robot to drop the ball. The “Constraint-Enforced Actions” (red/maroon arrows) ensure that the relative geometry is preserved, keeping the ball stable.

3. Action Labeling

Once the new camera pose is determined (via optimization) and the new image is synthesized (via diffusion), the system calculates the new action label. It uses the known kinematic chain of the robot to calculate the joint positions required to achieve the new pose.

The result is a new training pair: (Synthesized Image, Corrective Action). When added to the dataset, this teaches the robot: “If you find yourself slightly off-center while holding the box, here is how you adjust to stay on track.”

Experimental Setup

The team evaluated D-CODA across 5 simulation tasks and 3 real-world tasks. These tasks were chosen specifically to test coordination.

Simulation Tasks (using RLBench/PerAct2):

Coordinated Lift Ball: Balancing a large sphere.
Coordinated Lift Tray: A task requiring a flat orientation.
Push Box: Moving a heavy object.
Dual Push Buttons: Synchronized pressing.
Bimanual Straighten Rope: Deformable object manipulation.

Figure 5: Simulation environments for our bimanual manipulation tasks.

Real-World Setup: They used two UR5 robotic arms equipped with RealSense wrist cameras.

Figure 7: Real-world bimanual UR5 setup.

Results and Analysis

The results demonstrate that D-CODA significantly outperforms baselines, including VISTA (a state-of-the-art single-view augmentation method) and Bimanual DMD (Diffusion Meets DAgger, but without the coordinated constraints).

Simulation Results

In the simulation benchmarks, D-CODA achieved the highest success rates in 4 out of 5 tasks.

Table 1: Simulation results comparison.

Take the Coordinated Lift Tray task. Standard ACT (Action Chunking with Transformers) without augmentation only succeeded 37.3% of the time. With D-CODA, that jumped to 44.0%, while the uncoordinated “Bimanual DMD” plummeted to 13.3%. This proves that naive augmentation acts as noise that confuses the policy in coordinated tasks.

Real-World Performance

The real-world results were even more telling. The team tested on Lift Ball, Lift Drawer, and Push Block tasks.

Table 2: Real-world experiment results comparing D-CODA with baselines.

As seen in Table 2, D-CODA achieved a 17/20 success rate on the “Lift Ball” task, compared to 15/20 for standard ACT and only 12/20 for VISTA.

Why the difference? In the real world, lighting conditions and exact object positions vary. D-CODA’s augmented data covers a wider region of the state space, making the robot robust to these variations.

Figure 3: Isometric view of original and augmented camera positions.

Figure 3 illustrates this coverage. The blue dots are the original human demonstrations. The maroon and yellow clouds are the D-CODA augmented states. The method effectively explores the space around the ideal trajectory, giving the robot a “buffer zone” of knowledge.

Visual Quality

It’s also worth noting the quality of the generated images. The diffusion model preserves the visual fidelity of the grippers and objects, even when synthesizing views that don’t exist in the original data.

Figure 11: Examples of the original and synthesized wrist-camera images from both arms using D-CODA.

In Figure 11 (Real-World), look at the red-bordered images (augmented). They maintain consistent lighting and sharp details on the grippers, which is essential for the policy to recognize the state correctly.

Conclusion and Implications

D-CODA represents a significant step forward for bimanual robot learning. It tackles the specific bottleneck of coordination in data augmentation.

Key Takeaways:

Augmentation needs Physics: You can’t just hallucinate new images for robotics; the geometry must make sense.
Contact Matters: Distinguishing between free-space motion and contact-rich manipulation is vital for generating valid training data.
Efficiency: D-CODA operates entirely offline. It generates training data without requiring a simulator or hours of extra robot teleoperation.

By combining the generative power of diffusion models with the strict constraints of kinematic optimization, D-CODA allows robots to learn more robust bimanual skills from fewer human demonstrations. As we move toward general-purpose humanoid robots, techniques like this will be essential for teaching them how to handle the complex, two-handed world we live in.

The Problem: Why Bimanual Data is Hard to Get#

Enter D-CODA#

1. The Diffusion Model for Novel-View Synthesis#

2. The Logic: Constrained Sampling#

Mode A: Contactless (Free Space)#

Mode B: Contact-Rich (holding an object)#

3. Action Labeling#

Experimental Setup#

Results and Analysis#

Simulation Results#

Real-World Performance#

Visual Quality#

Conclusion and Implications#