Introduction

We have entered an era where legged robots are doing far more than just walking. From the viral videos of Boston Dynamics’ Atlas doing parkour to quadrupedal robots performing agile jumps, the field is pushing toward high-dynamic, acrobatic behaviors. But there is a massive gap between watching a robot execute a backflip in a perfectly controlled simulation and achieving that same feat on physical hardware without the robot destroying itself.

Rotational maneuvers, like front flips, are fundamentally different from running. They require the generation of massive angular momentum, precise modulation of body inertia (like a figure skater tucking their arms), and the ability to survive high-impact landings. For Reinforcement Learning (RL) engineers, this presents a paradox: you need the robot to be aggressive enough to complete the rotation, but conservative enough to respect the physical limits of its motors and gears.

In this deep dive, we explore a fascinating paper that tackles this problem head-on. Using a minimalist one-leg hopper—essentially a robotic pogo stick—the researchers propose a new framework for learning impact-rich rotations. They introduce a physics-inspired reward system based on Centroidal Angular Velocity and implement rigorous “Sim-to-Real” techniques that model the specific limitations of electric motors and gearboxes.

The result? The first successful hardware realization of a front flip on a one-leg hopper. Let’s break down how they did it.

The Challenge: Why Flips Are Hard

Before diving into the solution, we need to understand the platform. The researchers used a custom-designed 3-DOF (Degrees of Freedom) one-leg hopper.

Figure 2: One-leg hopper used in this study.

As shown in Figure 2, the robot is minimal. It has a thigh, a calf, and a foot, with a closed-loop ankle mechanism. It weighs about 12kg—roughly the proportions of a human leg. Unlike a quadruped (which has four legs to stabilize) or a humanoid (which has arms to help generate momentum), this robot has very limited authority. To flip, it must use its single leg to launch, spin, and catch itself.

This creates two major hurdles for Reinforcement Learning:

The Physics Hurdle: How do you tell an RL agent to “spin the whole body”? Standard rewards often fail to capture the complex relationship between momentum and inertia.
The Hardware Hurdle: Simulation physics engines are often too optimistic. They assume motors can output constant torque regardless of speed, and they don’t simulate the stress on internal gears. In reality, attempting a flip often leads to “voltage saturation” (where the motor can’t spin any faster) or catastrophic gear fractures upon landing.

Part 1: The Physics of Rotation

The core contribution of this paper regarding behavior learning is a new way to incentivize rotation. To understand it, we have to look at how we usually train robots.

The Failure of Standard Rewards

In RL, we shape behavior using reward functions. If you want a robot to flip forward (pitch rotation), the intuitive approach is to reward the Base Angular Velocity (BAV). Essentially, you tell the robot: “Maximize the pitch speed of your main body (the thigh).”

However, the researchers found that this fails. Why? Because of the conservation of angular momentum. If the robot aggressively swings its leg forward, the body will rotate backward to compensate. The internal joints are moving fast, satisfying the “high velocity” reward, but the system as a whole isn’t spinning. The robot just flails its leg in the air without achieving a net rotation.

An alternative is to reward Centroidal Angular Momentum (CAM). This metric looks at the momentum of the entire system around its center of mass. While this successfully gets the robot to generate a strong takeoff impulse, it creates a new problem: the robot jumps with a straight leg and stays straight. It generates momentum, but because it has a large moment of inertia (it’s stretched out), it rotates slowly and crashes before completing the flip.

The Solution: Centroidal Angular Velocity (CAV)

To fix this, the authors introduce a reward based on Centroidal Angular Velocity (CAV).

In physics, angular momentum ($L$) is the product of the moment of inertia ($I$) and angular velocity ($\omega$):

\[ L = I \omega \]

To flip successfully, you want high angular velocity ($\omega$). Rearranging the equation, $\omega = L / I$. This means to maximize rotation speed, the robot must do two things simultaneously:

Maximize Momentum ($L$): Push hard off the ground during takeoff.
Minimize Inertia ($I$): Tuck the leg in mid-air to reduce the radius of gyration.

The CAV reward inherently encourages both behaviors. It drives the policy to generate a massive impulse at launch and then immediately fold the knee during flight to accelerate the spin—exactly how a human diver performs a tuck.

The difference is stark in the data. Look at Figure 4:

The Blue Line (Proposed CAV): Notice graph (f). The composite inertia drops significantly in mid-air because the robot is tucking. Consequently, in graph (e), the velocity spikes, allowing the robot to complete the full $2\pi$ rotation shown in graph (a).
The Orange Line (CAM): The robot generates momentum but fails to reduce inertia (graph f stays high). The rotation is sluggish, and the flip fails.

Part 2: Bridging the Sim-to-Real Gap

Even with the perfect physics reward, a policy trained in a standard simulation will likely fail on a real robot. This is because high-dynamic motions push hardware to its absolute limits.

The Motor Operating Region (MOR)

In most simulations, we set a simple “clip” on motor torque (e.g., “Max Torque = 30 Nm”). This creates a rectangular box of allowable actions.

Real electric motors don’t work like that. As a motor spins faster, it generates “back-EMF” (electromotive force) that opposes the driving voltage. This means the faster the motor moves, the less torque it can produce.

The researchers modeled this specific Motor Operating Region (MOR) for the hopper’s actuators.

$Figure 3: Motor Operating Region (MOR) at the knee actuator. Simulation data were collected within \$\\pm 0 . 2 5\$ seconds during the flip motion around take-off,with take-off and knee fold events also noted in the figure. Red regions denote areas beyond the MOR.$

In Figure 3, you can see the “voltage limit slope” (the red line).

During take-off (green arrow moving right), the robot needs high torque and high speed.
During the knee fold (green arrow moving left), it needs to tuck rapidly.

If the RL agent ignores this slope (using the standard “box” limits), it will try to command torque that physically doesn’t exist at those high speeds.

Figure 5: Comparison of simulation and hardware results for policies with and without MOR constraints. (a-c) show base pitch angular velocity, pitch rotation,and vertical height (simulation only), respectively. (d)-(e) show knee torque around take-off from policies trained without MOR (simulation) and with MOR (hardware experiment),respectively. Shaded regions indicate MOR violations.

Figure 5 illustrates the consequence of ignoring the MOR. The purple line represents a simulation trained without these constraints. It looks great in sim, but when you look at (d), you see the policy is commanding torque (solid blue) that is way above the physical limit (black line).

When this “optimistic” policy is transferred to the real robot (or a realistic sim), the motor hits the voltage ceiling and the torque drops off. The robot under-rotates and crashes. The proposed method (orange/blue lines), trained with the MOR constraint, learns a strategy that stays safely inside the motor’s capabilities.

Protecting the Gearbox: Transmission Load Regularization

The final piece of the puzzle is structural integrity. The one-leg hopper has a high gear ratio to generate force, but this makes the gears fragile against “shock” loads.

When the robot lands, the impact force travels from the foot, through the linkages, and into the gearbox (specifically the sun gear). Early in the experiments, the team encountered catastrophic failures where the sun gear would fracture upon landing.

To prevent this, they implemented Transmission Load Regularization. They couldn’t measure the gear load directly on the robot during training, so they estimated it using the contact Jacobian and the impact forces in simulation. They then added a penalty to the reward function for high transmission loads.

$Figure 8: Pitch rotation across repeated hardware flip trials. The baseline (without load regularization) failed on the second trial due to sun gear fracture (inset photo); the regularized policy completed eight trials without failure.$

The impact of this regularization is visceral. Figure 8 shows the survival rate. The red line (Baseline) shows the robot failing on the second trial due to a shattered gear (inset photo). The blue line (Proposed) shows consistent success over eight trials.

Figure 7: External torques on the ankle actuator’s sun gear during landing, for policies trained without vs. with transmission load regularization. (a),(b) are simulated; (c),(d) are estimated from hardware experiments.Time was aligned such that landing starts at O.O2 s in all cases.

Figure 7 explains why. The policy trained with regularization (blue) learns a landing strategy that significantly reduces the peak torque spike on the gear (compare the massive red spikes in c and d to the controlled blue lines). The robot learns to absorb the impact more softly, likely by engaging the ankle mechanism differently or timing the contact to distribute forces over a longer window.

The Results

By combining the Centroidal Angular Velocity reward (for better physics) with MOR modeling and Load Regularization (for hardware reality), the team achieved a robust front flip.

Figure 1: Snapshots of the first successful deployment of the learned front flip on the real one-leg hopper platform. The policy was trained with a centroidal velocity-based reward and sim-to-real techniques to achieve robust, impact-rich rotation on hardware.

As seen in Figure 1, the motion is dynamic and controlled. The robot extends fully for takeoff, tucks tight to spin, and extends again to catch the landing without shattering its gearbox.

The researchers also demonstrated that this framework generalizes. They used the same reward structure to teach the hopper to do barrel rolls and yaw spins, and even transferred the method to a quadruped (Unitree Go1) to learn backflips.

Figure 16: Additional maneuvers learned by the one-leg hopper using the same reward framework.

Figure 17: A backflip trained on the Unitree Gol quadruped.

Conclusion

This research highlights a crucial lesson in modern robotics: learning algorithms cannot ignore the physical reality of the machine.

Physics Matters: Simple rewards like “spin fast” often lead to local optima. Understanding the centroidal dynamics (momentum vs. inertia) allows us to design rewards that guide the robot toward physically superior techniques, like the mid-air tuck.
Hardware is the Limit: A simulation is only as good as its fidelity. By modeling the voltage limits of motors and the breaking points of gears, we can train policies that are not just high-performing, but durable.

For students and researchers entering the field of legged robotics, this paper serves as a blueprint for closing the sim-to-real gap. It’s not enough to get a high score in the simulator; you have to respect the voltage limits and save the gears. Only then can you truly stick the landing.

Introduction#

The Challenge: Why Flips Are Hard#

Part 1: The Physics of Rotation#

The Failure of Standard Rewards#

The Solution: Centroidal Angular Velocity (CAV)#

Part 2: Bridging the Sim-to-Real Gap#

The Motor Operating Region (MOR)#

Protecting the Gearbox: Transmission Load Regularization#

The Results#

Conclusion#