Imagine wearing a VR headset or AR glasses. You reach out to grab a virtual cup of coffee. For the experience to feel real, the system needs to know exactly where your hand is—not just in the camera’s view, but in the actual 3D room.
This sounds simple, but it is a surprisingly difficult computer vision problem. In “egocentric” (first-person) video, two things are moving at once: your hands and your head (the camera). Traditional methods struggle to separate these motions. If you turn your head left, it looks like your hand moved right. Furthermore, your hands frequently drop out of the camera’s view, causing tracking systems to “forget” where they are.
In this post, we are diving deep into HaWoR (Hand World Reconstruction), a new research paper that proposes a robust method for reconstructing 3D hand motion in world-space coordinates from a single egocentric video.

The Core Problem: Camera Space vs. World Space
To understand the innovation of HaWoR, we first need to understand the limitation of previous work. Most existing 3D hand pose estimation methods operate in camera space.
- Camera Space: The coordinate system is relative to the lens. If the camera moves forward 1 meter, the hand’s coordinates change by 1 meter, even if the hand remained perfectly still in the room.
- World Space: The coordinate system is fixed to the environment (e.g., the floor of the room). This is what we actually need for AR/VR and robotics.
Recovering world-space motion from a single video is hard because of scale ambiguity (how big is that movement in meters?) and occlusion (hands block the background, and the background blocks hands).
The HaWoR Solution: Divide and Conquer
The researchers propose a “divide and conquer” strategy. Instead of trying to predict world coordinates directly from pixels (which is too complex), they decompose the problem into three manageable tasks:
- Hand Motion Estimation: Reconstruct the hand’s shape and pose relative to the camera.
- Camera Trajectory Estimation: Figure out exactly how the camera moved through the room and fix the scale.
- Motion Infilling: Guess where the hands are when they leave the camera’s field of view.
Let’s look at the complete pipeline below.

As shown in Figure 2, the system takes an input video and processes it through two main branches. The top branch handles the hand estimation using a Transformer-based network. The bottom branch handles the camera motion using a modified SLAM (Simultaneous Localization and Mapping) approach. Finally, an “Infiller” network patches up the missing data.
Let’s break down each component.
1. Hand Motion Estimation in Camera Frame
The first step is getting a good 3D mesh of the hand relative to the camera. The authors build upon the MANO model (a standard mathematical model for representing hand geometry).
They use a Vision Transformer (ViT) to extract features from the video frames. However, egocentric videos are messy—hands are often blurred or partially cut off at the edge of the frame. To handle this, HaWoR introduces two specific attention modules:
- Image Attention Module (IAM): This looks at the visual features of adjacent frames. If the hand is blurry in frame \(t\), the sharp details from frame \(t-1\) or \(t+1\) help refine the current view.
- Pose Attention Module (PAM): This applies attention directly to the predicted pose parameters, ensuring the hand moves naturally over time rather than jittering.
The network is trained using a composite loss function that enforces accuracy in both 3D alignment and 2D projection:

Here, \(\mathcal{L}_{3D}\) ensures the 3D skeleton is correct, \(\mathcal{L}_{2D}\) ensures the skeleton aligns with the pixels in the image, and \(\mathcal{L}_{MANO}\) ensures the shape parameters are realistic.
2. Adaptive Egocentric SLAM
This is where the paper gets particularly clever. To place the hands in the world, we need to know where the camera is. The standard tool for this is SLAM (Simultaneous Localization and Mapping). Specifically, the authors use an algorithm called DROID-SLAM.
However, standard SLAM fails in egocentric hand videos for two reasons:
- Dynamic Objects: SLAM algorithms assume the world is static. Moving hands cover a large part of the view in first-person video. The SLAM system sees the hands moving and often mistakenly thinks the camera is moving in the opposite direction.
- Scale Ambiguity: Monocular SLAM (using one camera) creates a map with arbitrary units. It doesn’t know if you moved 1 meter or 1 centimeter.
Solving the Dynamic Object Problem
HaWoR solves the first issue by masking. The system takes the detected hands and masks them out of the visual data fed into the SLAM algorithm.

In Equation 2, \(M_t\) is the hand mask. By multiplying the image \(I_t\) and the confidence map \(w_t\) by \((1 - M_t)\), the researchers force the SLAM system to ignore the hands and track the camera based only on the static background.
Solving the Scale Problem with Metric3D
To fix the scale issue (converting “SLAM units” to real-world meters), the authors use a foundation model called Metric3D. This AI model is trained to look at a single image and guess the absolute depth of pixels in meters.
But Metric3D isn’t perfect—it’s less accurate for objects very close or very far away. HaWoR uses an Adaptive Sampling Module (AdaSM). They sample depth points only from a “sweet spot” distance range (\(D_{min}\) to \(D_{max}\)), excluding the hands.

Once they have these reliable points, they calculate a scaling factor \(\alpha\) that aligns the SLAM trajectory with the real-world metrics predicted by Metric3D:

This results in a camera trajectory that is not only accurate in shape but also correct in physical size.
3. The Hand Motion Infiller
In egocentric video, you often look away from your hands, or you might reach for something outside your field of view. Most systems simply stop tracking, resulting in broken trajectories.
HaWoR introduces a Motion Infiller Network. This is a Transformer that takes the visible sequence of hand motions and predicts the missing frames.
The Canonical Space Transformation
Predicting motion in camera space is hard because the camera is moving wildly. Predicting in world space is hard because the hand could be anywhere in the room.
The solution is to transform the motion into Canonical Space. This aligns the start of the motion sequence to a standard zero-position and zero-rotation.

The transformation logic (Equation 7) effectively decouples the hand’s local movement from the camera’s global movement:

Once in this standardized space, the Infiller network (a Transformer Encoder) predicts the missing pose, rotation, and translation tokens. The network is trained using loss functions that penalize errors in global orientation (\(\Phi\)), root translation (\(\Gamma\)), pose (\(\Theta\)), and shape (\(\beta\)):

Experiments and Results
The researchers validated HaWoR on challenging datasets like DexYCB (hand-object interaction) and HOT3D (egocentric videos with ground truth).
Robustness to Occlusion
First, they checked how well the system estimates hand pose in the camera frame, particularly when the view is blocked.

As Table 1 shows, HaWoR outperforms state-of-the-art methods like WiLoR and Deformer. The gap is most noticeable in the “75%-100%” occlusion column, where HaWoR maintains a much lower error rate (MPJPE of 5.07 vs 5.68 for WiLoR). This proves the value of the temporal attention modules (IAM/PAM).
Camera Trajectory Accuracy
Next, they evaluated the camera trajectory. This is critical—if the camera path is wrong, the hand path in the world will be wrong too.

In Figure 4, you can see standard DROID-SLAM drifting away (the orange line). HaWoR’s adaptive SLAM (blue line) stays incredibly close to the ground truth (purple).
Table 2 quantifies this success. The “ATE-S” metric (Average Trajectory Error with Scale) drops significantly when using the proposed Adaptive Sampling (AdaSM).

World-Space Hand Reconstruction
Finally, the main event: reconstructing hands in the world. The authors compared HaWoR against optimization-based methods (HMP-SLAM) and regression methods (WiLoR-SLAM).

The results in Table 3 are striking. The W-MPJPE (World Mean Per Joint Position Error) drops from ~119mm (HMP-SLAM) to just 33.20mm with HaWoR. This is a massive leap in accuracy.
Qualitatively, the difference is obvious in complex scenarios, such as picking up a kettle or using a mouse.

Figure 3 illustrates that HaWoR follows the complex curves of human motion faithfully, whereas other methods often result in jagged or offset paths.
The “Infiller” also proves its worth. Figure 5 shows long-range hand movements where the hand likely exited the frame. HaWoR (blue) maintains a smooth curve that aligns with reality.

In-the-Wild Generalization
The team also applied HaWoR to the EPIC-KITCHENS dataset. These are real-world cooking videos without ground truth data. Despite not being trained on this dataset, HaWoR produces plausible reconstructions where other methods fail (see Figure 7 below).

Conclusion and Limitations
HaWoR represents a significant step forward for egocentric computer vision. By smartly decoupling hand motion from camera motion and treating them as separate but related problems, the authors achieve state-of-the-art results.
Key Takeaways:
- Masking matters: You cannot run SLAM on egocentric video without hiding the moving hands.
- Context is King: Using temporal attention (IAM/PAM) allows the system to “see” through occlusions.
- Don’t forget the invisible: The Infiller network proves that we can accurately hallucinate hand positions when they leave the frame by understanding canonical motion patterns.
Limitations: However, the system isn’t perfect. It relies on an off-the-shelf detector to find the hands initially. If that detector fails (e.g., swapping left and right hands), HaWoR will fail too. Additionally, because the two hands are modeled independently, they can sometimes visually “clip” through each other, as the physics of hand collision aren’t fully constrained yet.

Despite these limitations, HaWoR offers a blueprint for the future of AR/VR tracking, moving us closer to systems that understand human motion not just as pixels on a screen, but as physical actions in the real world.
](https://deep-paper.org/en/paper/2501.02973/images/cover.png)