Imagine wearing a VR headset or AR glasses. You reach out to grab a virtual cup of coffee. For the experience to feel real, the system needs to know exactly where your hand is—not just in the camera’s view, but in the actual 3D room.

This sounds simple, but it is a surprisingly difficult computer vision problem. In “egocentric” (first-person) video, two things are moving at once: your hands and your head (the camera). Traditional methods struggle to separate these motions. If you turn your head left, it looks like your hand moved right. Furthermore, your hands frequently drop out of the camera’s view, causing tracking systems to “forget” where they are.

In this post, we are diving deep into HaWoR (Hand World Reconstruction), a new research paper that proposes a robust method for reconstructing 3D hand motion in world-space coordinates from a single egocentric video.

Figure 1. HaWoR separates camera motion from hand motion to place hands accurately in the 3D world, even when they leave the camera frame.

The Core Problem: Camera Space vs. World Space

To understand the innovation of HaWoR, we first need to understand the limitation of previous work. Most existing 3D hand pose estimation methods operate in camera space.

  • Camera Space: The coordinate system is relative to the lens. If the camera moves forward 1 meter, the hand’s coordinates change by 1 meter, even if the hand remained perfectly still in the room.
  • World Space: The coordinate system is fixed to the environment (e.g., the floor of the room). This is what we actually need for AR/VR and robotics.

Recovering world-space motion from a single video is hard because of scale ambiguity (how big is that movement in meters?) and occlusion (hands block the background, and the background blocks hands).

The HaWoR Solution: Divide and Conquer

The researchers propose a “divide and conquer” strategy. Instead of trying to predict world coordinates directly from pixels (which is too complex), they decompose the problem into three manageable tasks:

  1. Hand Motion Estimation: Reconstruct the hand’s shape and pose relative to the camera.
  2. Camera Trajectory Estimation: Figure out exactly how the camera moved through the room and fix the scale.
  3. Motion Infilling: Guess where the hands are when they leave the camera’s field of view.

Let’s look at the complete pipeline below.

Figure 2. Overview of the HaWoR pipeline. It splits the video into a hand branch (top) and a camera/SLAM branch (bottom), merging them to create a unified world-space output.

As shown in Figure 2, the system takes an input video and processes it through two main branches. The top branch handles the hand estimation using a Transformer-based network. The bottom branch handles the camera motion using a modified SLAM (Simultaneous Localization and Mapping) approach. Finally, an “Infiller” network patches up the missing data.

Let’s break down each component.

1. Hand Motion Estimation in Camera Frame

The first step is getting a good 3D mesh of the hand relative to the camera. The authors build upon the MANO model (a standard mathematical model for representing hand geometry).

They use a Vision Transformer (ViT) to extract features from the video frames. However, egocentric videos are messy—hands are often blurred or partially cut off at the edge of the frame. To handle this, HaWoR introduces two specific attention modules:

  1. Image Attention Module (IAM): This looks at the visual features of adjacent frames. If the hand is blurry in frame \(t\), the sharp details from frame \(t-1\) or \(t+1\) help refine the current view.
  2. Pose Attention Module (PAM): This applies attention directly to the predicted pose parameters, ensuring the hand moves naturally over time rather than jittering.

The network is trained using a composite loss function that enforces accuracy in both 3D alignment and 2D projection:

Equation 1. The loss function combines 3D joint error, 2D projection error, and MANO parameter accuracy.

Here, \(\mathcal{L}_{3D}\) ensures the 3D skeleton is correct, \(\mathcal{L}_{2D}\) ensures the skeleton aligns with the pixels in the image, and \(\mathcal{L}_{MANO}\) ensures the shape parameters are realistic.

2. Adaptive Egocentric SLAM

This is where the paper gets particularly clever. To place the hands in the world, we need to know where the camera is. The standard tool for this is SLAM (Simultaneous Localization and Mapping). Specifically, the authors use an algorithm called DROID-SLAM.

However, standard SLAM fails in egocentric hand videos for two reasons:

  1. Dynamic Objects: SLAM algorithms assume the world is static. Moving hands cover a large part of the view in first-person video. The SLAM system sees the hands moving and often mistakenly thinks the camera is moving in the opposite direction.
  2. Scale Ambiguity: Monocular SLAM (using one camera) creates a map with arbitrary units. It doesn’t know if you moved 1 meter or 1 centimeter.

Solving the Dynamic Object Problem

HaWoR solves the first issue by masking. The system takes the detected hands and masks them out of the visual data fed into the SLAM algorithm.

Equation 2. Masking operation to remove hand regions from the SLAM input.

In Equation 2, \(M_t\) is the hand mask. By multiplying the image \(I_t\) and the confidence map \(w_t\) by \((1 - M_t)\), the researchers force the SLAM system to ignore the hands and track the camera based only on the static background.

Solving the Scale Problem with Metric3D

To fix the scale issue (converting “SLAM units” to real-world meters), the authors use a foundation model called Metric3D. This AI model is trained to look at a single image and guess the absolute depth of pixels in meters.

But Metric3D isn’t perfect—it’s less accurate for objects very close or very far away. HaWoR uses an Adaptive Sampling Module (AdaSM). They sample depth points only from a “sweet spot” distance range (\(D_{min}\) to \(D_{max}\)), excluding the hands.

Equation 3. Selecting reliable depth points S_t within a specific range, excluding hand regions.

Once they have these reliable points, they calculate a scaling factor \(\alpha\) that aligns the SLAM trajectory with the real-world metrics predicted by Metric3D:

Equation 4. Optimization function to find the best scale factor alpha.

This results in a camera trajectory that is not only accurate in shape but also correct in physical size.

3. The Hand Motion Infiller

In egocentric video, you often look away from your hands, or you might reach for something outside your field of view. Most systems simply stop tracking, resulting in broken trajectories.

HaWoR introduces a Motion Infiller Network. This is a Transformer that takes the visible sequence of hand motions and predicts the missing frames.

The Canonical Space Transformation

Predicting motion in camera space is hard because the camera is moving wildly. Predicting in world space is hard because the hand could be anywhere in the room.

The solution is to transform the motion into Canonical Space. This aligns the start of the motion sequence to a standard zero-position and zero-rotation.

Figure 8. Transforming motion into canonical space allows the network to learn pure hand motion independent of camera location.

The transformation logic (Equation 7) effectively decouples the hand’s local movement from the camera’s global movement:

Equation 5 and 7. Mathematical formulation for converting world-space coordinates into canonical space.

Once in this standardized space, the Infiller network (a Transformer Encoder) predicts the missing pose, rotation, and translation tokens. The network is trained using loss functions that penalize errors in global orientation (\(\Phi\)), root translation (\(\Gamma\)), pose (\(\Theta\)), and shape (\(\beta\)):

Equation 6. The loss function for the Infiller network.

Experiments and Results

The researchers validated HaWoR on challenging datasets like DexYCB (hand-object interaction) and HOT3D (egocentric videos with ground truth).

Robustness to Occlusion

First, they checked how well the system estimates hand pose in the camera frame, particularly when the view is blocked.

Table 1. Comparison of camera-frame accuracy. Note the performance under 75-100% occlusion.

As Table 1 shows, HaWoR outperforms state-of-the-art methods like WiLoR and Deformer. The gap is most noticeable in the “75%-100%” occlusion column, where HaWoR maintains a much lower error rate (MPJPE of 5.07 vs 5.68 for WiLoR). This proves the value of the temporal attention modules (IAM/PAM).

Camera Trajectory Accuracy

Next, they evaluated the camera trajectory. This is critical—if the camera path is wrong, the hand path in the world will be wrong too.

Figure 4. Visual comparison of camera trajectories. DROID-SLAM (orange) drifts significantly, while HaWoR (blue) hugs the Ground Truth (purple).

In Figure 4, you can see standard DROID-SLAM drifting away (the orange line). HaWoR’s adaptive SLAM (blue line) stays incredibly close to the ground truth (purple).

Table 2 quantifies this success. The “ATE-S” metric (Average Trajectory Error with Scale) drops significantly when using the proposed Adaptive Sampling (AdaSM).

Table 2. Trajectory error metrics. The proposed method achieves the lowest error (15.86mm) compared to baselines.

World-Space Hand Reconstruction

Finally, the main event: reconstructing hands in the world. The authors compared HaWoR against optimization-based methods (HMP-SLAM) and regression methods (WiLoR-SLAM).

Table 3. World-space quantitative evaluation. HaWoR drastically reduces the World MPJPE.

The results in Table 3 are striking. The W-MPJPE (World Mean Per Joint Position Error) drops from ~119mm (HMP-SLAM) to just 33.20mm with HaWoR. This is a massive leap in accuracy.

Qualitatively, the difference is obvious in complex scenarios, such as picking up a kettle or using a mouse.

Figure 3. Visualization of hand trajectories. The blue skeletons (HaWoR) match the green ground truth much better than the yellow (HMP-SLAM) or purple (WiLoR) baselines.

Figure 3 illustrates that HaWoR follows the complex curves of human motion faithfully, whereas other methods often result in jagged or offset paths.

The “Infiller” also proves its worth. Figure 5 shows long-range hand movements where the hand likely exited the frame. HaWoR (blue) maintains a smooth curve that aligns with reality.

Figure 5. Trajectory comparison for right-hand motion. Note the smoothness of the proposed method (blue).

In-the-Wild Generalization

The team also applied HaWoR to the EPIC-KITCHENS dataset. These are real-world cooking videos without ground truth data. Despite not being trained on this dataset, HaWoR produces plausible reconstructions where other methods fail (see Figure 7 below).

Figure 7. Qualitative comparison on in-the-wild videos. HaWoR handles truncation at the frame boundary significantly better than HaMeR or WiLoR.

Conclusion and Limitations

HaWoR represents a significant step forward for egocentric computer vision. By smartly decoupling hand motion from camera motion and treating them as separate but related problems, the authors achieve state-of-the-art results.

Key Takeaways:

  1. Masking matters: You cannot run SLAM on egocentric video without hiding the moving hands.
  2. Context is King: Using temporal attention (IAM/PAM) allows the system to “see” through occlusions.
  3. Don’t forget the invisible: The Infiller network proves that we can accurately hallucinate hand positions when they leave the frame by understanding canonical motion patterns.

Limitations: However, the system isn’t perfect. It relies on an off-the-shelf detector to find the hands initially. If that detector fails (e.g., swapping left and right hands), HaWoR will fail too. Additionally, because the two hands are modeled independently, they can sometimes visually “clip” through each other, as the physics of hand collision aren’t fully constrained yet.

Figure 9. Failure cases caused by incorrect left/right hand detection labels.

Despite these limitations, HaWoR offers a blueprint for the future of AR/VR tracking, moving us closer to systems that understand human motion not just as pixels on a screen, but as physical actions in the real world.