Introduction

When we imagine a humanoid robot, we typically picture it doing one of two things: walking on two legs or standing still while using its hands to manipulate an object. This mirrors how traditional robotics control has evolved—treating the robot as a bipedal platform for mobile manipulation.

But think about how humans actually move. We don’t just walk. We sit in chairs, we crawl under low obstacles, we trip and roll to break a fall, and we push ourselves up from the ground. We use our elbows, knees, backs, and shoulders to interact with the world. In robotics terms, humans embrace “full-body ground contacts.”

For a robot, however, touching the ground with anything other than the soles of its feet is usually considered a failure state. This limitation exists for a reason: predicting contact forces is hard. Predicting contact forces when you don’t know if the robot’s elbow or its backpack will hit the ground next is exponentially harder.

In this post, we are diving into a fascinating paper titled “Embrace Contacts: Humanoid Shadowing with Full Body Ground Contacts.” The researchers present a new framework that allows humanoid robots to perform “extreme,” contact-rich motions—like breakdancing, crawling, and recovering from a prone position—using a standard reinforcement learning pipeline.

A humanoid robot performing various complex poses including sitting and crawling.

As shown in Figure 1 above, this work moves beyond simple walking. It enables a robot to execute complex, dynamic motions in the real world, bridging the gap between simulation and reality without requiring expensive motion capture setups during deployment.

The Problem: Why Can’t Robots Just “Roll With It”?

To understand the contribution of this paper, we first need to understand why whole-body contact is such a nightmare for roboticists.

1. The Interface Struggle

Standard locomotion controllers use velocity commands. You tell a robot: “Move forward at 0.5 m/s” or “Turn right.” This works great for walking. But how do you describe a somersault using linear velocity? Or a transition from crawling to standing? When a robot is rolling, its “forward” direction is constantly spinning. Defining motion commands for these gymnastics in the robot’s base frame is ambiguous and impractical.

2. The Simulation Gap

Modern robots are trained in simulation (Sim-to-Real). We train a brain (policy) in a physics engine and then copy-paste it into the real robot. However, simulators use rigid-body physics. They simplify collisions. In simulation, a “hard” robot hitting a “hard” floor is a clean mathematical event. In reality, a robot has rubber pads, wires, and slight flex, and the ground interaction is messy.

When a robot only walks, we only have to worry about the feet. When a robot is breakdancing, we have to model collisions for the knees, elbows, shoulders, and hands. If the simulation doesn’t perfectly match reality, the robot might flail or break itself when deployed.

3. Data Scarcity

We often teach robots by having them mimic human motion data (MoCap). The most popular dataset, AMASS, is huge but boring. It mostly contains people standing and walking. It lacks the “extreme” data—like rolling on the floor—needed to train a robust recovery policy.

The Solution: A Unified Humanoid Motion Framework

The researchers propose a pipeline that shadows (mimics) human motion. The goal is to take a raw human motion file (like a person crawling) and have the robot replicate it in the real world, handling all the balance and contact physics in real-time.

Diagram showing the pipeline from human data to simulation training to real-world deployment.

Figure 2 illustrates the complete workflow. It is divided into two phases: Simulation Training and Real-World Deployment.

Data Curation: They combine standard datasets with a custom “Extreme-Action” dataset scraped from internet videos.
Retargeting: Human skeletons and robot skeletons are different. The system retargets human motion to joint angles the robot can physically achieve.
The Policy (Brain): A neural network takes the current state and a sequence of future commands to decide how to move the motors.
Sim-to-Real: The trained policy runs on the robot’s onboard computer (Nvidia Jetson), taking high-level commands and outputting low-level torques.

Let’s break down the technical innovations that make this work.

Innovation 1: The Extreme-Action Dataset

You cannot learn what you haven’t seen. If you train a robot only on walking data, it will never learn how to use its elbows to stabilize itself.

The authors found that standard datasets were insufficient. To fix this, they used computer vision tools (specifically 4D-Human) to extract 3D motion from internet videos of people breakdancing, doing Jiu-Jitsu, and scrambling on the ground.

Comparison of policy behavior when trained on standard datasets versus extreme datasets.

The difference is stark. In Figure 6 (above), look at the “Policy Behavior” on the left.

Top Row (Standard Dataset): The robot tries to crawl but refuses to put its hands on the ground because it wasn’t trained to trust that contact. It awkwardly squats.
Bottom Row (Extreme Dataset): The robot confidently plants its hands and knees, effectively crawling.

Innovation 2: The Transformer-Based Command Encoder

How do we tell the robot what to do? The researchers use a Motion Command Sequence. Instead of giving a single velocity target, they feed the robot a “filmstrip” of future keyframes (target poses) for the next few seconds.

This creates a new problem: variable input lengths. Sometimes you might provide a long horizon of future motions, sometimes short. To handle this, they utilize a Transformer Encoder.

Architecture of the Transformer-based command encoder.

As shown in Figure 4, the architecture works as follows:

Input: The network receives a sequence of future target motions (Target 1 to Target N).
Processing: A Multi-headed Attention Encoder processes these targets. This allows the robot to “attend” to specific important keyframes in the future (e.g., “I need to prepare now for that jump happening in 2 seconds”).
Selection: The system selects the embedding relevant to the immediate future (\(t_{left}\) is the smallest positive value) to guide the immediate action.
Actor MLP: This encoded command is concatenated with the robot’s proprioception history (its knowledge of its own current joint angles and speeds) and fed into a Multi-Layer Perceptron (MLP) to generate motor actions.

Why a sequence?

You might ask, “Why not just give the robot the next frame?”

Comparison of single-frame vs. multi-frame command tracking.

Figure 7 demonstrates why looking ahead is vital.

Single Frame Command (Middle Row): The robot acts reactively. It sees the next pose and tries to jerk its body to match it instantly. This leads to instability and falling.
Multi Frame Command (Bottom Row): The robot sees the entire movement arc. It can prepare its momentum to swing its legs or shift its weight smoothly, resulting in successful motion tracking.

Innovation 3: Handling Physically Infeasible Commands

Here is a subtle but difficult problem: Human motion capture data is “floaty.” A human in a video might jump, and the MoCap data simply says the body moves up. But a robot has to generate force against the ground to move up.

Furthermore, because the robot’s body shape is different from a human’s, the retargeted motion might be physically impossible (e.g., the target pose might require the robot’s hand to be inside the floor).

A robot performing a breakdance move where the reference pose is floating.

In Figure 3, the robot is performing a breakdance flare. The reference motion (the “ghost” or target) might be slightly floating or penetrating the ground due to retargeting errors. The robot cannot simply “teleport” to that pose.

The policy must learn to be compliant. It needs to try to follow the command but prioritize physics and balance. If the command says “put hand through floor,” the policy must learn to “put hand on floor and push.” This requires a very specific reward strategy during training.

Innovation 4: Advantage Mixing for Multi-Critic RL

This is the heaviest technical contribution of the paper. In Reinforcement Learning (RL), we design a “reward function” to tell the robot when it’s doing a good job.

For this task, the rewards are conflicting:

Task Reward (Sparse): Did you reach the target pose? (This only happens occasionally).
Regularization Reward (Dense): Don’t use too much energy. Don’t jerk the motors. Keep joint velocities low. (This is calculated at every single millisecond).

If you just add these numbers together, the dense regularization rewards often overpower the sparse task rewards. The robot decides it’s safer to just stand still and save energy than to risk moving to hit a target.

The authors use a technique called Advantage Mixing. Instead of one “Critic” (the network that estimates how good a state is) summing all rewards, they use three separate Critics:

Task Critic
Regularization Critic
Safety Critic

Each critic learns independently.

Equation for multi-critic loss function.

Equation (1) shows that each critic \(V_{\Psi^{(i)}}\) minimizes its own prediction error for its specific reward group \(r^{(i)}\).

Then, when updating the Actor (the policy), they mix the “advantages” (how much better an action was than expected) using weighted averages:

Equation for advantage mixing.

In Equation (2), \(A_i\) represents the advantage calculated by a specific critic. By normalizing and weighting these advantages (\(w_i\)), the researchers ensure that the robot pays attention to the task and the safety constraints without one drowning out the other.

Graph showing success rate of Multi-Critic vs Single-Critic.

The impact of this math is visible in Figure 8. The Multi-Critic approach (blue line) learns faster and reaches a higher success rate compared to a standard Single-Critic approach (orange line).

Real-World Experiments

The team deployed this system on a Unitree G1 humanoid. The setup was “blind”—the robot did not use external cameras or motion capture to know where it was in the room. It relied entirely on its internal sensors (motor encoders and IMU) and the command sequence.

Hardware setup with Unitree G1 and high-level command laptop.

As seen in Figure 5, the policy runs on the internal Nvidia Jetson NX. An external laptop acts as the “coach,” sending the high-level motion plan (the breakdancing moves or crawl sequence).

The Results

The robot successfully performed motions that are rarely seen in humanoid robotics:

Getting up from the ground: Recovering from a fall.
Rolling: Continuous ground contact with the torso.
Standing Dance: Complex upper-body movement while maintaining balance.

Table 2 (below) provides a glimpse into the simulation parameters used to achieve this. Note the massive number of parallel robots (4096) used to gather enough experience.

Table showing simulation parameters like 4096 robots and domain randomization ranges.

They also utilized extensive Domain Randomization. They varied the robot’s mass, motor friction, and sensor delays during training. This ensures that even though the simulation is imperfect, the policy becomes robust enough to handle the chaotic real world.

The Reward Structure

For those interested in the specifics of the RL tuning, the authors detailed their reward terms.

Table showing reward terms for tracking, regularization, and safety.

Equation for the Gaussian kernel used in rewards.

They use a Gaussian kernel (Equation 3) for tracking rewards. This means the reward is 1.0 if the error is zero, and it decays exponentially as the error grows. This encourages precision without causing numerical instability when errors are huge.

Conclusion: The Future of Humanoid Motion

The paper “Embrace Contacts” pushes the boundary of what we consider “locomotion.” By creating a pipeline that handles variable future commands, imperfect datasets, and complex physical interactions, the authors have demonstrated that humanoids don’t need to be restricted to walking.

The key takeaways are:

Data Matters: To do cool tricks, you need “extreme” training data.
Look Ahead: Transformer-based prediction of future motions stabilizes complex maneuvers.
Mix Your Critics: separating rewards for task and safety allows robots to learn difficult motions without getting “lazy” or “fearful” during training.

This research opens the door for humanoids that can operate in truly unstructured environments—crawling through search-and-rescue tunnels, recovering from slips on ice, or perhaps just showing off on the dance floor.

The robot isn’t just walking anymore; it’s moving.

Introduction#

The Problem: Why Can’t Robots Just “Roll With It”?#

1. The Interface Struggle#

2. The Simulation Gap#

3. Data Scarcity#

The Solution: A Unified Humanoid Motion Framework#

Innovation 1: The Extreme-Action Dataset#

Innovation 2: The Transformer-Based Command Encoder#

Why a sequence?#

Innovation 3: Handling Physically Infeasible Commands#

Innovation 4: Advantage Mixing for Multi-Critic RL#

Real-World Experiments#

The Results#

The Reward Structure#

Conclusion: The Future of Humanoid Motion#