Becoming the Robot: How CLONE Solves the Drift Problem in Humanoid Teleoperation

The dream of telepresence—operating a humanoid robot as if it were your own body—is a staple of science fiction. We imagine wearing a VR headset and having a robot miles away perfectly mimic our movements, walking through a room to pick up a package or perform a repair.

However, the reality of robotics is often clumsier than the dream. While we have made massive strides in robotic control, two persistent “villains” plague humanoid teleoperation: unnatural movement and positional drift.

Most current systems play it safe. They separate the upper body (arms) from the lower body (legs) to prevent the robot from falling over. This results in stiff, robotic movements. Even worse, these systems often operate “open-loop.” They blindly assume the robot moved exactly where the operator told it to. In reality, tiny slips and mechanical imperfections accumulate. After walking ten meters, the operator might think the robot is in front of a table, while the robot is actually two feet to the left, grasping at thin air.

In a new paper titled CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks, researchers introduce a system that tackles these issues head-on. By combining a “Mixture-of-Experts” neural network with real-time LiDAR feedback, they have created a system that allows for fluid, whole-body coordination and stays precisely on track over long distances.

Figure 1: CLONE employs an MoE-based policy with closed-loop error correction for humanoid teleoperation, enabling precise whole-body coordination and long-horizon task execution.

The Core Problems: Coordination and Drift

To understand why CLONE is significant, we first need to understand the limitations it overcomes.

1. The Coordination Gap

Humans move with “whole-body synergy.” When you reach down to pick up a box, you don’t just move your arm; you bend your knees, shift your hips, and lean your torso simultaneously. Traditional teleoperation often decouples these. It stabilizes the legs separately while you control the arms. This makes complex actions—like reaching for a door handle while walking sideways—extremely difficult.

2. The Drift Problem

This is the silent killer of long-horizon tasks. In an open-loop system, the controller sends commands based on where the human is, but it doesn’t get feedback on where the robot actually is in the world.

Imagine walking across a football field with your eyes closed, trying to walk in a perfectly straight line. You might think you’re walking straight, but after 50 meters, you will likely have drifted significantly to the side. Robots suffer from this same issue due to foot slippage and sensor noise. Without a correction mechanism, the robot eventually “hallucinates” its position relative to the operator.

The CLONE Solution

The researchers propose CLONE, which stands for Closed-Loop Whole-Body Humanoid Teleoperation. The system uses a minimalist setup: a human operator wears a commercial Mixed Reality (MR) headset (like the Apple Vision Pro). The system tracks only the operator’s head and hands.

From just these three points of data, CLONE generates full-body humanoid motion, from walking to squatting, while ensuring the robot stays exactly where it’s supposed to be.

Figure 2: Whole-body humanoid teleoperation from minimal input. Our approach enables intuitive control of a humanoid robot using only head and hand poses from mixed reality input.

As illustrated above, the system creates a feedback loop. It doesn’t just send commands; it constantly measures the error between the teleoperator and the robot and corrects it in real-time.

The Architecture: How It Works

The CLONE framework is built on three pillars: a specialized dataset, a Teacher-Student training pipeline, and a closed-loop deployment strategy.

Figure 3: The CLONE framework. (a) CLONED curates and augments retargeted AMASS data. (b) A teacher policy is trained using privileged information. (c) An MoE network serves as the student policy. (d) For real-world deployment, we integrate LiDAR odometry.

1. The Brain: Mixture-of-Experts (MoE)

Controlling a humanoid is a multi-objective problem. The physics of walking are very different from the physics of crouching or standing still. A standard neural network (often a Multi-Layer Perceptron or MLP) tries to learn a single strategy for all these movements, which often leads to mediocre performance across the board. The robot might walk well but fall over when squatting.

CLONE uses a Mixture-of-Experts (MoE) architecture. Imagine a team of specialists rather than one generalist.

The Experts: The network contains multiple sub-networks (“experts”), each capable of specializing in different types of motion dynamics.
The Router: A gating network analyzes the current state of the robot and the operator’s command. It then assigns “weights” to the experts, deciding which ones should handle the current moment.

For example, if the operator starts to squat, the router might activate “Expert 3” and “Expert 4,” who specialize in stable, low-center-of-gravity poses. If the operator starts running, “Expert 1” might take over.

To ensure the network utilizes all its experts efficiently (rather than relying on just one for everything), the researchers implemented a balancing loss function:

Equation for balancing loss.

This equation ensures that the router distributes the workload evenly across experts during training, preventing “model collapse” where only a few experts are ever used.

2. The Eyes: Closed-Loop Error Correction

This is the feature that solves the drift problem. In the real world, the robot uses LiDAR odometry (specifically an algorithm called FAST-LIO2) to understand its exact global position.

Simultaneously, the operator’s headset tracks their global position in the room. The CLONE policy continuously calculates the difference between the Operator’s Position and the Robot’s Position.

Instead of just mimicking the pose (body shape), the robot is trained to minimize this positional difference. If the robot slips slightly to the left, the LiDAR detects the discrepancy relative to the operator, and the policy generates footstep adjustments to bring the robot back in sync.

3. The Knowledge: The CLONED Dataset

You cannot learn what you haven’t seen. Existing motion capture datasets (like AMASS) are great for computer graphics but often lack the specific details needed for robotics, such as precise hand orientations or the transitions between different types of movements.

The researchers curated the CLONED dataset. They filtered existing data for feasibility and augmented it with custom motion capture sessions focused on robotic tasks—specifically emphasizing hand orientation and continuous transitions (like walking into a crouch).

Experimental Results

The researchers validated CLONE on a Unitree G1 humanoid robot. The results showed a drastic improvement over traditional methods.

Eliminating Drift

In a “straight-path” test, the operator walked 8.9 meters. In open-loop systems, the error would typically grow larger the further the robot walked. With CLONE, the mean tracking error was kept to just 5.1 cm.

Figure 4: Global position tracking accuracy in real-world experiments. CLONE achieves mean tracking errors of 5.1cm across distances up to 8.9m.

The graph above demonstrates the consistency. Whether at 3 meters or nearly 9 meters, the error distribution remains tight and low. This reliability allows for “long-horizon” tasks—you can confidently walk the robot from the kitchen to the living room without it drifting into a wall.

Whole-Body Versatility

The MoE architecture proved its worth in handling diverse motions. The robot successfully performed:

Waving: Requires upper body dexterity without destabilizing the legs.
Squatting: A difficult task for bipeds due to the shifting center of mass.
Jumping: A highly dynamic motion requiring explosive force and precise landing control.

Figure 5: Whole-body motion tracking on Unitree G1. CLONE successfully tracks diverse skills including (a) waving, (b)(d) squatting, and (c) jumping.

Why MoE Matters: The Ablation Study

To prove that the Mixture-of-Experts architecture was actually responsible for the performance gains, the researchers compared CLONE against a version of itself that used a standard MLP (labeled as CLONE^†).

Table 1: Motion tracking evaluation on CLONED dataset. Comparison of CLONE against ablations.

As shown in Table 1, the full CLONE system had significantly lower errors across the board compared to the MLP version (CLONE^†) and the version trained on older datasets (CLONE*).

Furthermore, when testing specifically for difficult stances like deep squats (low height), the MoE model maintained better tracking accuracy where other models struggled to maintain the correct velocity or hand orientation.

Figure 7: Motion tracking performance across stance heights. Comparison between CLONE (blue solid), CLONE * (green dashed), and CLONE † (red dashed).

Figure 7 highlights an interesting trade-off: CLONE (blue line) prioritizes local motion fidelity (getting the body shape and velocity right) even in difficult crouching positions, whereas the baseline methods struggle significantly more as the height decreases (moving to the left on the x-axis).

Peeking Inside the “Brain”

The researchers also visualized when specific experts were activated.

Figure A2: The activation status of each expert.

The heatmap above confirms the “specialist” theory. Notice how the activation patterns change depending on the row (the action). “Squat” and “Crouchwalk” activate specific experts intensely, while “Stand” activates a different set. This specialization allows the robot to switch dynamic strategies instantly based on the operator’s intent.

Conclusion and Future Outlook

CLONE represents a significant step forward in making humanoid teleoperation practical. By moving from open-loop control to a closed-loop system grounded in LiDAR data, it solves the critical issue of positional drift. Simultaneously, the Mixture-of-Experts architecture allows the robot to handle the complex, conflicting dynamics of different human movements without falling over.

While there are still limitations—highly dynamic movements like jumping are still less stable than simple walking, and the input is limited to just head and hands—this research bridges the gap between the operator and the avatar. It moves us closer to a future where operating a robot in a hazardous environment feels as natural and reliable as walking through the room yourself.

The Core Problems: Coordination and Drift#

1. The Coordination Gap#

2. The Drift Problem#

The CLONE Solution#

The Architecture: How It Works#

1. The Brain: Mixture-of-Experts (MoE)#

2. The Eyes: Closed-Loop Error Correction#

3. The Knowledge: The CLONED Dataset#

Experimental Results#

Eliminating Drift#

Whole-Body Versatility#

Why MoE Matters: The Ablation Study#

Peeking Inside the “Brain”#

Conclusion and Future Outlook#