Imagine you are in a kitchen and you spot a coffee mug on the counter. You don’t just “drive” your body to the counter like a tank and then extend your arm like a crane. You walk towards it, likely adjusting your stride, leaning your torso forward, and extending your hand—all in one fluid, coordinated motion. Your eyes lock onto the target, and your body follows.
For humanoid robots, however, this seamless integration of navigation (walking to a place) and manipulation (reaching for things) is incredibly difficult. Historically, roboticists have treated these as separate problems: mobile bases handle the 2D navigation, and once the robot stops, a manipulator arm handles the reaching.
But humanoids are not mobile bases; they are articulated systems that can crouch, lean, and step over obstacles. To truly be useful, a humanoid needs to coordinate its eyes, hands, and feet simultaneously.
In this post, we are diving into a research paper titled “Hand-Eye Autonomous Delivery (HEAD)”. The researchers propose a new framework that teaches humanoid robots to navigate, locomote, and reach by learning directly from human motion and vision data. The result is a system that allows a Unitree G1 robot to spot an object in a room, walk over to it, and touch it—navigating complex 3D environments just as we would.

The Disconnect in Humanoid Control
Before we look at the solution, we need to understand the problem. Traditional robot navigation treats the robot as a cylinder or a point on a 2D map. This works great for a Roomba, but it limits a humanoid. A humanoid can step over clutter or squeeze through narrow gaps.
Conversely, “whole-body control” (WBC) usually focuses on balance and tracking specific joint angles, often without a high-level understanding of where the robot needs to go in a large room.
The authors of HEAD argue that to bridge this gap, we shouldn’t try to train one massive neural network to do everything from pixels to torques. Instead, they propose a modular approach that decouples “seeing” from “moving,” connected by a very specific interface: the 3-point track.
The HEAD Architecture
The core philosophy of HEAD is “Hand-Eye Delivery.” The robot’s job is to “deliver” its hands and eyes to a specific target location.
As shown in the system overview below, the framework is split into two main levels:
- High-Level Policy (The Planner): This part sees the world through RGB cameras. It decides where the robot’s head (eyes) and hands need to be in the future to reach the goal.
- Low-Level Policy (The Controller): This is a whole-body controller that receives those target positions (head and hands) and figures out how to move the robot’s 27 joints to get them there while maintaining balance.

This separation is brilliant because it allows the researchers to train the vision system using video data (which is plentiful) and the movement system using motion capture data (which is distinct), rather than needing a massive, impossible-to-collect dataset of robots doing everything perfectly.
Let’s break down these components in detail.
1. The Low-Level: Whole-Body Motion Controller
The foundation of this system is the robot’s ability to move naturally. The researchers used Reinforcement Learning (RL) to train a policy that controls the robot’s joints.
The input to this policy is sparse: it only knows the target positions and orientations for three points: the head (eyes), the left hand, and the right hand. The policy’s job is to figure out the complex legwork and body balancing required to move those three points to the targets.
Learning from Human Motion (The GAN Approach)
To ensure the robot moves like a human (and not like a glitching video game character), the team used a dataset of human motion capture (MoCap). However, simply “cloning” human motion is rigid. The robot needs to be able to mix and match skills—for example, walking while holding a hand steady.
They employed a GAN-like (Generative Adversarial Network) framework. In this setup, the control policy tries to generate movements, while “discriminators” (critics) try to tell if the movement looks like real human motion.

Here is the key innovation: Decoupled Discriminators.
If you use a single discriminator for the whole body, the robot learns correlations that might not always be useful. For example, humans usually swing their arms when walking. If the robot learns that “walking = arm swinging,” it will struggle to carry a cup of coffee without spilling it.
To fix this, the researchers split the discriminators:
- Upper-body discriminator: Judges if the torso and arms look natural.
- Lower-body discriminator: Judges if the legs and gait look natural.
This allows the robot to combine a “walking” lower body with a “reaching” upper body, effectively mixing skills to suit the task.
The training objective is mathematically represented as a multi-objective optimization:

Here, the system balances the imitation rewards (looking human) with the task rewards (reaching the target). Speaking of imitation, the reward function for the discriminators looks like this:

And to ensure the robot actually does what it’s told, a goal-directed reward penalizes the robot if its head and hands drift from the target:

By combining these, the low-level controller becomes a robust “muscle memory” for the robot, capable of walking, crouching, and reaching just by tracking three points in space.
2. The High-Level: Navigation & Reaching
Now that the robot has a capable body, it needs a brain to guide it. The high-level policy is divided into two modules: Navigation (getting close) and Reaching (touching the object).
The Navigation Module
The navigation module operates on visual data. It takes an image from the robot’s head camera and a user-selected 2D goal point (e.g., clicking on a toy in the image).
It uses a Transformer-based architecture. The image is processed to extract features, which are combined with the goal point. The Transformer then predicts a future trajectory for the camera (the “eyes”).

The Data Challenge: Training a visual navigation system requires massive amounts of data. Collecting this on a physical robot is slow and expensive.
The Solution: The researchers utilized Aria Glasses—wearable smart glasses that capture ego-centric video. This allowed them to use hours of human videos (people walking around kitchens, cleaning, etc.) as training data.
However, a human doesn’t see exactly what a robot sees. To bridge this domain gap, they applied geometric transformations (undistortion and homography) to the human video to make it look like it was filmed by the robot’s fisheye camera. This clever data augmentation allowed the robot to “hallucinate” that it had hours of navigation experience before it even took its first step.
Transitioning to Reaching
Once the robot navigates close enough to the target, the system switches to the Reaching Module.

At this stage, the robot uses a second, downward-facing camera (RGB-D). The system detects the object in this new view and calculates the precise 3D position of the hand required to touch it. It uses Model-Based Inverse Kinematics (IK) to generate the final smooth path for the hand, ensuring the transition from “walking mode” to “fine manipulation mode” is seamless.
Experiments and Results
The researchers deployed HEAD on a Unitree G1 humanoid robot. They tested it in two real-world environments: a lab and a kitchen. Crucially, the kitchen was a “deploy room”—the robot had never seen it during training.

Key Findings
- It Works in the Real World: The system achieved a 71% success rate in reaching novel objects in unseen layouts. This is impressive for a humanoid navigating unstructured obstacles.
- Robustness to Clutter: Unlike wheeled robots that might get stuck on a rug or unable to reach over a low stool, the G1 used its full body to navigate.
- The Importance of Mixed Data:
- Training with only human data failed (success rate ~14%) because the robot is not a human (different height, speed).
- Training with only robot data worked in the lab but failed in the new kitchen (overfitting).
- The Winning Recipe: combining a small amount of robot data with large-scale human dataset (Aria Digital Twin). This mixture allowed the robot to learn its own body dynamics while gaining general navigation intelligence from humans.
Network Architecture
For the machine learning enthusiasts, here is a look at the network structures used for the policy and value networks in the RL phase. Note the use of GRUs (Gated Recurrent Units) to handle the temporal nature of motion.

Conclusion & The Future
The HEAD framework demonstrates a significant step forward for humanoid robotics. By moving away from the rigid separation of “platform” and “manipulator,” and instead treating the whole body as a delivery system for the hands and eyes, the researchers achieved more natural and capable behaviors.
Three big takeaways from this work:
- Decoupling is powerful: Separating vision (High-Level) from muscle control (Low-Level) makes the problem solvable.
- Three points are enough: You don’t need to control every joint explicitly from the visual planner. guiding the Head and Hands is sufficient to drive complex whole-body behavior.
- Human data is fuel: We can teach robots to navigate by watching humans, provided we mathematically adjust the video to match the robot’s perspective.
While the system has limitations—it currently relies mostly on upper-body tracking and doesn’t explicitly reason about foot placement for complex terrain like stairs—it paves the way for humanoids that can truly live and work alongside us in our cluttered, vertical 3D world.
](https://deep-paper.org/en/paper/2508.03068/images/cover.png)