Imagine a flat, thin computer keyboard lying on a desk. You want to pick it up. If you just try to grab it from the top, your fingers will likely hit the desk before you can get a solid grip. The desk “occludes” (blocks) the grasp. So, what do you do naturally? You probably use your non-dominant hand to tilt the keyboard up or brace it, while your dominant hand secures the grip.

This bimanual (two-handed) coordination is second nature to humans, but it is an immense challenge for robots. Standard robotic grasping usually involves a single arm planning a path to an object. When the environment itself—like the table—blocks the gripper, the robot fails.

In this deep dive, we are exploring COMBO-Grasp, a novel research paper from the University of Oxford. The researchers propose a system that mimics human bimanual intuition: one arm acts as a “constraint” to stabilize or tilt the object, while the other performs the grasp. By combining Self-Supervised Learning, Reinforcement Learning (RL), and a clever use of Diffusion Models, COMBO-Grasp enables robots to pick up “ungraspable” objects.

COMBO-Grasp Overview: (1) Right arm moves to support pose, (2) Left arm pushes object against constraint and grasps, (3) Right arm retreats, (4) Left arm lifts.

As shown in the figure above, the system orchestrates a complex dance: the right arm sets a “pick” (like in basketball), and the left arm drives the object into it to create a graspable gap.

The Challenge: Why is “Occluded Grasping” So Hard?

In robotics, “occluded grasping” refers to scenarios where a valid grasp pose exists, but it is kinematically infeasible because of collisions with the environment.

Traditional solutions fall into two buckets, both with significant downsides:

  1. Open-Loop Planning: The robot calculates a path and follows it blindly. This fails here because the robot needs to interact with the object (pushing, tilting) to create the grasp pose. You can’t plan a grasp for a pose that doesn’t exist yet.
  2. Reinforcement Learning (RL): You could train an RL agent to figure it out. However, bimanual manipulation doubles the action space (two arms moving at once). The “search space” for the solution becomes exponentially larger, making standard RL extremely sample-inefficient and difficult to converge.

COMBO-Grasp (Constraint-based Manipulation for Bimanual Occluded Grasping) tackles this by breaking the problem down. Instead of asking one giant brain to control both arms from scratch, it decouples the problem into two coordinated policies:

  1. A Constraint Policy: Controls the non-dominant arm to create a stabilizer.
  2. A Grasping Policy: Controls the dominant arm to manipulate the object against that stabilizer.

The COMBO-Grasp Architecture

The core philosophy of this paper is that learning becomes easier when you have a good partner. Here, the “partner” is the constraint policy. The method unfolds in three distinct stages, moving from simulation to the real world.

Method Overview: (1) Self-supervised constraint training, (2) RL Grasping training with refinement, (3) Teacher-Student distillation.

Phase 1: The Constraint Policy (Self-Supervised Learning)

Before the robot tries to grasp anything, it first needs to learn how to be a good support system. The authors needed a way to train the right arm to find good “stabilizing poses” without requiring thousands of hours of human demonstration.

Their solution? Physics-based Self-Supervision.

In a simulation, they take an object and a target grasp pose. They then randomly position the right arm (the constraint) near the object and apply a force to the object—simulating the left arm pushing it. If the object doesn’t move significantly, the right arm has successfully “constrained” it. This concept is known as Force Closure.

Using this technique, the authors generated a synthetic dataset of 144,000 successful constraint poses across 48 different objects. They then trained a Diffusion Model—specifically a Denoising Diffusion Probabilistic Model (DDPM)—to predict these poses.

The diffusion process iteratively refines a noisy input into a valid constraint pose. The training objective is to minimize the error between the noise added and the noise predicted by the network:

Constraint Policy Loss Equation

Where \(\epsilon_\theta\) is the neural network predicting the noise. The actual generation of the pose happens through an iterative denoising process:

Diffusion Step Equation

Here, the model starts with Gaussian noise and steps backward \(k\) times to reveal a clean, stable pose for the right arm.

Phase 2: The Grasping Policy (Reinforcement Learning)

Once the right arm knows how to stand still and be helpful, the left arm needs to learn how to use it. The authors train a “Teacher” grasping policy using Reinforcement Learning (specifically PPO).

This policy has access to “privileged information”—exact object positions, velocities, and physics parameters that a real robot wouldn’t know. This makes training faster.

The reward function is critical here. It’s not enough to just say “good job” if the object is lifted. The reward is a weighted sum of several factors:

Reward Function Equation

Let’s break down these terms:

  • \(r_{dist\_pos}\) & \(r_{dist\_ori}\): Reward the hand for getting close to the target grasp pose in both position and orientation.
  • \(r_{collision}\): A heavy penalty for hitting the table or the other arm (self-collision).
  • \(r_{action}\): Penalizes wild, high-magnitude movements to keep motion smooth.
  • \(r_{lift}\): Explicitly rewards moving the object up vertically.
  • \(r_{success}\): A big bonus for successfully completing the grasp.

The Innovation: Value Function-Guided Coordination

Here is the “aha!” moment of the paper.

If you just train the Constraint Policy (Phase 1) and the Grasping Policy (Phase 2) separately, they might not work well together. The Constraint Policy was trained on random pushes, not on the specific strategies the Grasping Policy learns.

To fix this, the authors introduce Value Function-Guided Policy Coordination.

During the RL training of the grasping arm, the system learns a Value Function (\(V\))—a critic that estimates how “good” a current state is (i.e., how likely it is to lead to a reward). The authors use the gradients from this Value Function to update the output of the Constraint Policy.

Think of it like this: The Grasping Policy (Left Arm) is trying to work, and the Value Function (The Critic) shouts to the Constraint Policy (Right Arm), “Hey, if you move two inches to the left, my estimated chance of success goes up!”

Mathematically, they modify the diffusion sampling step. They add a term that shifts the generation in the direction of the value function gradient (\(\nabla V\)):

Value Function Guidance Equation

This \(w\nabla V(\mathbf{x})\) term steers the constraint pose toward configurations that the grasping policy finds most useful. This mimics the “Classifier Guidance” technique often used in image generation models, but applied here to robotic control.

Phase 3: Teacher-Student Distillation

We now have a smart Teacher policy, but it relies on “privileged information” (perfect knowledge of physics and state) that doesn’t exist in the real world. To bridge the “Sim-to-Real” gap, the authors use Policy Distillation.

They train “Student” policies that only get to see what the robot actually sees: 3D Point Clouds from a depth camera.

Student Policy Architecture: DP3 Encoder processing point clouds feeding into Diffusion Policy

As shown above, the student architecture processes the noisy, partial point cloud data using a DP3 encoder and tries to mimic the actions of the expert Teacher. This allows the system to be deployed on physical hardware where objects are only visible through cameras.

Experiments and Results

The authors evaluated COMBO-Grasp in both simulation (Isaac Sim) and the real world, comparing it against standard PPO baselines and variations of their own method.

1. Does the “Constraint” approach actually help?

In simulation, they compared COMBO-Grasp against a standard RL policy controlling both arms (PPO) and an RL policy with a naive reward for using the second arm.

Training Curves: COMBO-Grasp (Red) reaches higher success faster than baselines.

The results are stark. Standard PPO (Grey) struggles to learn the coordination required. The COMBO-Grasp method (Red) learns significantly faster (sample efficiency) and reaches a much higher success rate (over 80%). The “Constraint” approach acts as a strong inductive bias, simplifying the problem enough for the RL agent to solve it.

2. Can it generalize to new objects?

A robot that can only pick up the objects it was trained on isn’t very useful. The authors tested the Student policies on “Unseen” objects—shapes the robot had never encountered during training.

Bar Chart: Success rates on Seen vs Unseen objects. COMBO-Grasp maintains high performance.

While performance drops slightly for unseen objects (as expected), COMBO-Grasp still significantly outperforms the baselines. The PPO baseline almost collapses on unseen objects, suggesting it “overfit” to the physics quirks of the training set rather than learning a generalizable skill.

3. Does the Value Function Guidance matter?

Is the complex math involving the value function gradients necessary? The authors ran an ablation study, scaling the guidance parameter (\(w\) or \(\gamma\)) from 0 (no guidance) to 1.0.

Ablation Study: Performance improves with guidance scaling.

The blue line (Gamma = 0) represents the system without coordination refinement. It learns slower and peaks lower. Adding the guidance (Orange/Green/Brown lines) consistently improves performance, proving that refining the constraint pose based on the grasper’s needs is crucial.

4. Real-World Performance

Finally, the ultimate test: the real world. The setup involved two Kinova Gen3 arms and a RealSense camera.

Real World Setup: Two arms, camera, and grippers.

They tested the robot on a variety of difficult items, including heavy boxes, thin keyboards, and round containers.

Real World Objects: Variety of shapes and sizes.

The real-world results were impressive. The system achieved an average success rate of 68.3% across all objects.

Real World Results Table

Notable observations from the real-world data:

  • Keyboards are hard: Without the “Grasp Pose” input (knowing exactly where to grab), the success rate on the keyboard was only 40%. With the target pose, it jumped to 80%. This is likely because if the first push fails on a thin object, it’s hard to recover without knowing exactly where the target is.
  • Round objects are tricky: The system struggled with the “Round-Large-Light” object (30% success). Spherical or cylindrical objects tend to roll away from the constraint arm, making stabilization difficult.

Conclusion

COMBO-Grasp represents a significant step forward in bimanual robotic manipulation. By treating the second arm not just as another actuator, but as a dynamic “wall” or constraint, the authors turned a complex coordination problem into a manageable one.

The three key takeaways for students and researchers are:

  1. Inductive Biases Help: Structuring the learning problem (one arm constrains, one arm grasps) is often more effective than throwing a massive neural network at raw data.
  2. Simulation is Powerful: Self-supervised data collection in simulation (checking force closure) provided a massive dataset that would be impossible to collect in reality.
  3. Cross-Policy Communication: The use of Value Function Guidance allows two separate policies to “talk” to each other via gradients, aligning their goals without needing to be trained as a single monolithic block.

While challenges remain—particularly with round objects or recovering from failed pushes—COMBO-Grasp demonstrates that when it comes to robotic manipulation, two hands truly are better than one.