Why Train Robots When You Can Train Humans? Scaling Humanoid Learning with Egocentric Video

The dream of general-purpose humanoid robots is inching closer to reality. We see the hardware improving rapidly—robots that can walk, carry boxes, and withstand shoves. But the “brain” of the robot, the policy that tells it how to dexterously manipulate objects like a cup or a screwdriver, remains a bottleneck.

The standard approach to teaching robots involves Imitation Learning (IL). You teleoperate a robot (control it remotely), record the data, and train a neural network to mimic those movements. It works, but it is painfully slow, expensive, and difficult to scale. You need a physical robot, a skilled operator, and endless hours of tedious repetition.

But consider this: Humans are essentially “biological humanoids.” We have two arms, a head, and binocular vision. We manipulate objects effortlessly. What if we could just record humans doing tasks and feed that data directly into a humanoid robot?

This is the premise of the research paper “Humanoid Policy \(\sim\) Human Policy”. The authors propose a unified framework where human egocentric data isn’t just a reference—it is treated as training data for a different robotic “embodiment.” By collecting a massive dataset of humans performing tasks and aligning it with robot data, they achieved a policy that is not only cheaper to train but significantly more robust to environmental changes.

Overview of the Human-Humanoid learning system showing data sources, the unified state-action space, and robot execution examples.

In this post, we will tear down how they bridged the gap between human hands and robot grippers, the clever engineering behind their data collection, and why treating a human as a robot might be the key to generalizable AI.

The Data Bottleneck and the Human Solution

To understand the magnitude of this contribution, we first need to look at the problem of Cross-Embodiment Learning.

In robotics, “embodiment” refers to the physical form of the agent—its size, joint limits, and sensor locations. Usually, if you want to train a specific robot (let’s call it Robot A), you need data collected on Robot A. If you try to use data from Robot B, it often fails because the joint angles and camera perspectives don’t match.

Humans represent an extreme case of a “different embodiment.” Our hands are soft and have five dexterous fingers; robot grippers are often rigid and may only have two or three contact points. Our heads move involuntarily; robot heads are usually fixed or servo-controlled.

Traditionally, researchers try to bridge this gap using intermediate representations. They might extract “affordances” (where to grasp an object) or object keypoints (tracking the mug rather than the hand). The problem is that these modular steps introduce errors. If your object tracker fails, your policy fails.

This paper takes a bolder approach: End-to-End Learning. They hypothesize that with enough data and the right alignment, a neural network can learn to map human visual inputs and hand movements directly to robot actions, treating the human as just another type of robot.

Introducing PH²D: Physical Humanoid-Human Data

The first hurdle was the data itself. There are plenty of datasets of humans doing things (like cooking or dancing), but they aren’t “task-oriented” in a way robots understand. A video of someone chopping onions is too high-level. A robot needs to know exactly how the wrist rotates and where the fingers go.

The researchers introduced PH²D, a large-scale, egocentric, task-oriented dataset.

How do you collect high-quality 3D hand poses and egocentric video without a multi-million dollar motion capture studio? The authors realized that consumer-grade Virtual Reality (VR) headsets, like the Apple Vision Pro and Meta Quest 3, have solved this problem for us. These devices have incredible inside-out tracking cameras and sophisticated hand-tracking algorithms built-in.

A white VR headset with ‘Camera/Pose Tracking’ highlighted, showing the low-cost setup used for data collection.

As shown in the image above, the setup is surprisingly accessible. By strapping a VR headset on a human operator, the researchers could record:

Visual Data: High-resolution video from the operator’s perspective (mimicking the robot’s head camera).
Proprioceptive Data: Accurate 3D position and rotation of the head and wrists, plus finger joint angles.

This method completely removes the robot from the data collection loop. A human can record hundreds of “pick and place” demonstrations in a kitchen, a lab, or an office in a fraction of the time it would take to teleoperate a robot to do the same.

The Core Method: Human Action Transformer (HAT)

Data is useless without a model to process it. The authors introduced the Human Action Transformer (HAT). This is a policy architecture designed to ingest data from multiple embodiments (humans and robots) and output actions that a humanoid can execute.

1. The Unified State-Action Space

The most critical insight of HAT is the Unified State-Action Space.

To train a single model on both human and robot data, the inputs and outputs must look the same mathematically.

Vision: The robot has cameras; the human has a VR headset. The images look different (lighting, room, arm appearance). To solve this, the authors use a “frozen” visual encoder called DinoV2. DinoV2 is famous for ignoring surface-level noise (like texture or lighting) and focusing on the semantic structure of a scene. It essentially tells the network, “There is a cup here,” regardless of whether the image comes from a GoPro or a robot’s webcam.
Proprioception (Body Sense): The robot has joint encoders; the human has tracked 3D poses. The authors unified this by mapping everything to End-Effector (EEF) poses. Instead of worrying about the elbow angle, they track where the wrist is and where the fingertips are in 3D space.

A diagram of the HAT architecture showing Human and Robot data inputs converging into a unified distribution processed by the transformer.

As illustrated in Figure 3, the architecture handles two streams:

Human Teleoperator/Robot Data: This is ground-truth robot data.
Human Demonstration: This is the VR data.

Both are fed into the HAT model. The output predicts the future trajectory of the hands (wrist position + rotation) and the gripper/finger state.

2. Retargeting and “Slowing Down”

There is a physical “dynamic gap” between humans and robots.

Speed: Humans are fast. We snatch objects quickly. Humanoid robots are generally slower and more careful to avoid damaging their motors or the environment.
Body Movement: Humans move their whole torso when reaching; robots might just move their arms.

If you train a robot directly on raw human speed, the robot will try to jerk its arms violently, triggering safety stops or causing damage.

To fix this, the researchers applied Interpolation. They calculated that humans are roughly 4x faster than the safe operating speed of the humanoid robots used. During training, they interpolate the human data to “slow it down,” effectively stretching the human timeline to match the robot’s capabilities.

3. The Objective Function

The training process minimizes the difference between the predicted action and the ground truth action. The loss function is defined as:

\[ \begin{array} { r } { \mathcal { L } = \ell _ { 1 } ( \pi ( \boldsymbol { s } _ { i } ) , \boldsymbol { a } _ { i } ) + \lambda \cdot \ell _ { 1 } ( \pi ( \boldsymbol { s } _ { i } ) _ { \mathrm { E E F } } , \boldsymbol { a } _ { i , \mathrm { E E F } } ) , } \end{array} \]

Let’s break this equation down:

\(\mathcal{L}\): The total loss (error) the network tries to minimize.
\(\ell_1\): A standard error metric (Mean Absolute Error) measuring the distance between prediction and reality.
The first term, \(\ell _ { 1 } ( \pi ( \boldsymbol { s } _ { i } ) , \boldsymbol { a } _ { i } )\), measures the error across the entire action vector (fingers, wrist, head).
The second term focuses specifically on the End Effector (EEF)—the hand position.
\(\lambda\) (lambda): A weighting factor (set to 2 in this paper).

Why is this important? By adding the second term and multiplying it by \(\lambda\), the researchers are telling the AI: “Getting the finger joints right is good, but getting the hand position right is twice as important.” In manipulation, if your hand is in the wrong place, it doesn’t matter what your fingers are doing—you will miss the object.

Experiments: Does it Work on Real Hardware?

The researchers tested HAT on two real humanoid robots, labeled Humanoid A (Unitree H1) and Humanoid B (Unitree H1-2). These robots have different arm configurations, which makes them perfect for testing cross-embodiment generalization.

Two humanoid robots, A and B, standing side-by-side. Humanoid B is holding a bottle.

They evaluated the system on four core tasks: Cup Passing, Horizontal Grasping, Vertical Grasping, and Pouring.

Illustrations of the four tasks: Cup passing, horizontal grasping, vertical grasping, and pouring, showing different backgrounds and objects.

Result 1: Human Data Improves Robustness (O.O.D. Generalization)

The most striking result came from “Out-Of-Distribution” (O.O.D.) tests.

In-Distribution (I.D.): Testing the robot in the exact same lab setup, lighting, and table arrangement as the robot training data.
Out-Of-Distribution (O.O.D.): Changing the tablecloth, moving the objects to new spots, or using objects with different colors.

When trained only on robot data, the policy (ACT baseline) struggled with O.O.D. scenarios. It memorized the specific look of the lab. However, when co-trained with the massive, diverse human dataset (PH²D), the robot became significantly smarter.

Table comparing success rates. HAT with Human Data shows nearly 100% improvement in O.O.D. tasks compared to ACT.

Looking at Table 2 above, look at the O.O.D. column.

ACT (Robot only): 59/170 success.
HAT (Robot + Human): 101/170 success.

This is a massive leap. Because the human data contained various rooms, lighting conditions, and table textures (from different VR sessions), the visual encoder learned to ignore the background and focus on the task (the hand and the object).

Result 2: Generalizing Object Placement

A common failure mode in robot learning is “spatial overfitting.” If you only train a robot to pick up a cup from the center of the table, it often fails if the cup is moved 10cm to the left.

The researchers visualized this using a grid heatmap for the “Vertical Grasping” task.

Heatmap grids showing grasping success rates. The ‘Mixed Data’ grid on the right shows much higher success across the board compared to ‘Robot Only’.

In the image above, the red dashed boxes indicate where robot data was collected.

Left (Robot Only): The robot is decent inside the dashed box but terrible everywhere else (lots of 0s and 1s).
Right (Mixed Data): The robot can grasp objects across almost the entire table (lots of 7s, 8s, and 9s).

Because human operators naturally stand in different places and reach for objects differently, the human dataset filled in the spatial gaps that the expensive robot teleoperation missed.

Result 3: Few-Shot Transfer to New Robots

What if you buy a new robot (Humanoid B) that has slightly different arms than your old one (Humanoid A)? Do you have to start from scratch?

The researchers showed that by pre-training on Humanoid A + Human Data, they could adapt to Humanoid B with very few demonstrations.

Chart showing performance vs. number of demonstrations. Co-training (green) consistently beats isolated training (orange).

As Figure 5 shows, with just 10 demonstrations on the new robot:

Training from scratch yielded ~40% success.
Co-training with Human priors yielded ~80% success.

The model learned the “concept” of the task from humans and Humanoid A, and only needed a tiny bit of data to adjust to the specific motor kinematics of Humanoid B.

Why This Matters: The Efficiency Equation

Perhaps the most practical takeaway for future research is the efficiency comparison. We all know training robots is hard, but how much harder is it than recording humans?

Table comparing data collection times. Human demos take ~4 seconds vs ~20-37 seconds for robot teleoperation.

Table 5 reveals the stark reality:

Human Demo (with VR): ~4 to 5 seconds per task.
Robot Teleoperation: ~20 to 37 seconds per task.

Not only is teleoperation 5x to 7x slower per attempt, but it also requires the physical robot to be present, powered on, and maintained. VR data collection can be distributed to hundreds of people in their own homes.

Conclusion

The paper “Humanoid Policy \(\sim\) Human Policy” challenges the notion that robot data is the only data that matters for robots. By treating humans as a “different type of robot” and aligning the data via a unified state-action space (HAT) and visual encoders, we can unlock massive scale.

The key takeaways for students and researchers are:

Don’t ignore human data: It is the cheapest, most diverse source of manipulation data available.
Visual representations matter: Using strong, pre-trained encoders (like DinoV2) is essential for bridging the visual gap between a VR headset’s view and a robot’s camera.
End-to-End is viable: You don’t necessarily need complex object detectors or affordance models. With enough data, a transformer can map pixels to joint angles directly, even across species.

As we look toward a future of general-purpose household robots, methods like HAT suggest that the best way to teach robots how to live in our world might be to let them watch us live in it first.

The Data Bottleneck and the Human Solution#

Introducing PH²D: Physical Humanoid-Human Data#

The Core Method: Human Action Transformer (HAT)#

1. The Unified State-Action Space#

2. Retargeting and “Slowing Down”#

3. The Objective Function#

Experiments: Does it Work on Real Hardware?#

Result 1: Human Data Improves Robustness (O.O.D. Generalization)#

Result 2: Generalizing Object Placement#

Result 3: Few-Shot Transfer to New Robots#

Why This Matters: The Efficiency Equation#

Conclusion#