Cheating to Win: How Privileged Actions Teach Robots Complex Skills

Reinforcement Learning (RL) has achieved remarkable things, from beating grandmasters at Go to teaching robots how to run. But if you ask a robot to perform a seemingly simple task—like picking up a credit card lying flat on a table—it often flails.

This specific type of problem is known as a “long-horizon, contact-rich” task. To succeed, the robot cannot just close its gripper; it must push the card to the edge of the table, reorient its hand, and then grasp it. This requires a sequence of precise interactions (pushing, sliding, pivoting) where the reward (holding the object) only comes at the very end. Standard RL struggles here because the search space is massive, and randomly stumbling upon this complex sequence is statistically impossible.

In a recent paper titled “Learning Long-Horizon Robot Manipulation Skills via Privileged Action,” researchers from the University of Edinburgh propose a fascinating solution: let the robot cheat.

By allowing the robot to break the laws of physics in simulation—penetrating tables and using magical “virtual forces”—they guide the policy toward the right behavior. Then, using a clever curriculum, they gradually remove these superpowers, leaving the robot with a robust, physically realistic skill set that transfers to the real world.

The Problem: The Exploration Trap

In robotic manipulation, the environment is typically modeled as a Markov Decision Process (MDP). The robot observes a state, takes an action, and receives a reward. The goal is to maximize the cumulative reward over time.

The standard RL objective function.

However, in contact-rich manipulation, the “physics” gets in the way of learning.

Geometric Constraints: A table surface blocks the gripper from getting under an object. The robot tries to go down, hits the table, and stops. It never learns that if it just pushed sideways, the object would move.
Sparse Rewards: The robot only gets a “success” signal when the object is lifted. If the object is ungraspable in its current pose, the robot wanders aimlessly, never receiving that first positive signal to learn from.

Traditional solutions involve reward shaping (manually designing rewards for every inch of progress) or imitation learning (showing the robot human demonstrations). Both are labor-intensive and limit the robot to mimicking human biases.

The authors of this paper ask: Can we change the simulation itself to make exploration easier?

The Solution: Privileged Actions

This framework introduces the concept of Privileged Actions. In simulation, we have “God mode.” We usually use this to give the robot privileged information (like exact object weight). But here, the researchers give the robot privileged capabilities.

The method follows a structured, three-stage curriculum shown below:

Overview of the three-stage framework. Stage 1 allows table penetration. Stage 2 adds virtual forces. Stage 3 is standard physics.

Stage 1: Constraint Relaxation (The Ghost Hand)

Imagine trying to grasp a flat box. The table prevents your fingers from wrapping around the bottom. In Stage 1, the researchers relax the collision constraints between the robot and the table.

They introduce a “virtual table” surface that is lower than the real table. The robot is allowed to penetrate the real table surface to a certain depth (\(\Delta_R\)) without physical consequence.

Constraint relaxation equation allowing penetration.

In this equation, \(\phi_R\) represents the distance to the table. By adding \(\Delta_R\), the system allows the robot to “ghost” through the table surface. This simplifies the geometric problem. The robot can now easily encompass the object with its gripper. It learns that “hand around object = good,” even if the physics aren’t quite right yet.

Stage 2: Virtual Forces (The Magnet Hand)

Once the robot knows where to put its hand, it needs to learn how to interact with the object. Even with the table relaxed, manipulating an object requires precise friction and contact forces.

To help with this, the researchers introduce Virtual Forces. Think of this as a temporary magnetic field. They modify the system dynamics so the policy can apply a direct, artificial force to the object to help pull it toward the gripper (or push it).

The modified dynamics equation including virtual forces on the object.

Here, the control input \(\mathbf{u}\) is split. The robot controls its own joints (\(\mathbf{u}_R\)), but it also gets to exert a “privileged” force on the object (\(\mathbf{u}_O\)). However, we don’t want the robot to become a Jedi using the Force from across the room. The influence is gated by a matrix \(\mathbf{B}(x_t)\).

The Gating Matrix B ensuring forces only apply when the hand is close.

This matrix ensures the virtual force only activates when the robot’s end-effector is close to the object in both position (\(\delta_p\)) and velocity (\(\delta_v\)). This encourages the robot to reach out and match the object’s movement, providing a “training wheel” for physical manipulation.

Stage 3: The Curriculum (Weaning Off the Cheats)

If we stopped at Stage 2, the robot would fail in the real world because real tables are solid and real hands aren’t magnetic. This is where the Auto-Curriculum comes in.

The framework monitors the robot’s success rate. As the robot starts succeeding with the help of privileged actions, the system tightens the constraints.

Raising the Table: The virtual table height is gradually raised until it matches the real table.
Weakening the Force: The allowed magnitude and distance for the virtual force are decayed using a curriculum factor \(\alpha\).

The curriculum decay function for the privileged parameters.

By the end of training, the robot is operating under normal physical laws (Stage 3). However, because it was guided toward the high-reward regions of the state space, it has already learned the necessary motor skills to solve the task legitimately.

Experimental Setup

The researchers tested this framework on standard IsaacGym environments using a Franka Emika Panda arm and a dexterous Allegro hand. They used a general reward function without tuning it for specific strategies like “pushing” or “pivoting.”

Reward function components: distance, lifting, goal, penalty, and bonus.

The reward (\(r_{total}\)) simply encourages getting close to the object (\(r_f\)), lifting it (\(r_l\)), and reaching a goal (\(r_k\)). It does not explicitly tell the robot to “push the object to the edge.”

Key Results

1. Emergent Long-Horizon Behaviors

The most striking result is that complex behaviors emerged naturally. In the “Push and Grasp” task, the robot realized it couldn’t grasp the flat object directly. It learned to push the box to the edge of the table, creating space for the gripper fingers to wrap around it.

Two emergent skills: Push-and-grasp (top) and Pivot grasp (bottom).

In a more constrained environment where walls blocked the table edge (see Fig. 3 below), the robot invented a Pivot Grasp. It used the table surface and its own base to wedge the object up, pivoting it into a vertical pose to grasp it.

The robot performing a pivot grasp in a constrained environment.

This confirms that the privileged actions allowed the robot to explore the physics and discover creative solutions that a standard policy would never find due to collision barriers.

2. Handling Complex Objects

The method was also tested on “YCB objects” like scissors, staplers, and wrenches using a multi-fingered hand. These are notoriously difficult because of their odd shapes and thin profiles.

The robot learning to slide scissors to the edge to grasp them.

As shown above, the robot learned to slide the scissors to the edge to pick them up. This behavior was not hard-coded; it was discovered.

3. Outperforming State-of-the-Art

The researchers compared their method against leading baselines like DexPBT (Population-Based Training) and SAPG (Split and Aggregate Policy Gradients).

Reward curves comparing the proposed method (blue) against DexPBT and SAPG.

The graphs above are telling. For challenging objects like the stapler, the baselines (orange and green lines) flatline or achieve very low rewards. They get stuck in local optima—likely hovering over the object without ever figuring out how to lift it. The proposed method (blue line) consistently converges to high success rates.

4. Why Do We Need Both Stages?

An ablation study revealed that both privileged stages are crucial.

Ablation study showing failure without Stage 2 (virtual forces).

No Stage 1 (No relaxation): The robot learns, but much slower. It struggles to find the initial grasp.
No Stage 2 (No virtual force): The robot fails completely (green line). Without the “magnetic” help to establish initial contact and movement, the exploration problem is just too hard.

Conclusion

This paper demonstrates a powerful concept: sometimes the best way to learn reality is to start with fantasy. By strategically breaking the laws of physics—allowing teleportation through tables and magnetic hands—we can guide robots through the “valley of death” in reinforcement learning exploration.

The beauty of this framework lies in its generality. The researchers didn’t need to engineer a “pushing reward” or a “sliding reward.” They used a generic lifting reward, and the robot figured out the pushing and sliding strategies on its own because the privileged actions gave it the freedom to explore.

As we move toward more general-purpose robots, methods that automate the discovery of complex skills—rather than requiring humans to hand-code every movement—will be essential. This work suggests that “privileged actions” might be a standard tool in the robot learning toolbox of the future.

The Problem: The Exploration Trap#

The Solution: Privileged Actions#

Stage 1: Constraint Relaxation (The Ghost Hand)#

Stage 2: Virtual Forces (The Magnet Hand)#

Stage 3: The Curriculum (Weaning Off the Cheats)#

Experimental Setup#

Key Results#

1. Emergent Long-Horizon Behaviors#

2. Handling Complex Objects#

3. Outperforming State-of-the-Art#

4. Why Do We Need Both Stages?#

Conclusion#