Introduction
Imagine wearing a pair of smart glasses. You are walking through your living room, reaching for a coffee mug, or typing on a keyboard. The glasses have cameras, but they are facing outward to map the world. They can see the mug, the table, and maybe your hands entering the frame. But they can’t see you—or at least, not your torso, legs, or feet.
This “invisibility” presents a massive challenge for Augmented Reality (AR) and robotics. If a computer system wants to understand your actions, it needs to know your full body pose. Is the user sitting or standing? Are they leaning forward? Where are their feet planted?
Traditionally, solving this required external motion capture systems (like the ones used in movies) or third-person cameras. But for a consumer device, we only have the sensors on the user’s head. We need to hallucinate the rest of the body based solely on how the head moves and what the hands are doing.
This brings us to EgoAllo, a research paper that proposes a robust system for estimating human body pose, height, and hand parameters using only egocentric (first-person) SLAM poses and images.

As shown in Figure 1, the system takes the feeds from wearable cameras and reconstructs a digital avatar that mirrors the user’s real-world actions, complete with accurate foot placement and hand interactions.
In this deep dive, we will explore the architecture of EgoAllo, unpacking the clever mathematical insights regarding “invariant conditioning” that make this reconstruction possible.
Background: The “Ego” and the “Allo”
To understand the contribution of this paper, we first need to define the coordinate frames involved.
- Egocentric (Ego): This is the reference frame relative to the device or the user’s head. In an egocentric view, the camera is the center of the universe.
- Allocentric (Allo): This is the reference frame of the world or the scene. If you walk from the kitchen to the living room, your allocentric position changes, even if the headset remains on your face.
The goal of this research is to take egocentric inputs (what the glasses see and feel) and produce an allocentric output (where the human is in the room).
The Input: SLAM
The system relies on SLAM (Simultaneous Localization and Mapping). Modern smart glasses, like the Project Aria devices used in this study, run SLAM algorithms to figure out where the device is in space. This provides a high-frequency stream of 6-DoF (Degrees of Freedom) poses—essentially, the exact position and rotation of the head at every millisecond.
The Challenge: Ambiguity
Why is this hard? Because the mapping from “head motion” to “body motion” is one-to-many. If your head drops 10 centimeters, are you squatting? Are you bending at the waist? Did you step down a stair?
To solve this, the researchers employ a Diffusion Model. Diffusion models are generative AI models (the same tech behind DALL-E or Midjourney) that learn to generate data by reversing a noise process. In this context, the model learns the probability distribution of natural human motions. It acts as a “motion prior,” helping the system guess the most likely body movement that explains the head’s motion.
Core Method: The EgoAllo Architecture
The EgoAllo system is a pipeline that transforms raw sensor data into a clean, physics-compliant 3D human reconstruction.
![Figure 2. Overview of components in EgoAllo. We restrict the diffusion model to local body parameters (Section 3.1.1). An invariant parameterization g(·) (Section 3.1.2) of SLAM poses is used to condition a diffusion model. These can be placed into the global coordinate frame via global alignment (Section 3.2.1) to input poses. When available, egocentric video is used for hand detection via HaMeR [66], which can be incorporated into samples via guidance (Section 3.2.2).](/en/paper/2410.03665/images/002.jpg#center)
As visualized in Figure 2, the process involves three main stages:
- Conditioning: Processing the SLAM poses into a format the neural network can understand.
- Diffusion: Generating local body poses and hand parameters.
- Guidance & Alignment: Using video data to refine the hands and placing the body into the world frame.
Let’s break these down, focusing on the paper’s most critical contribution: Invariant Conditioning.
1. The Insight: Invariant Conditioning
The researchers found that you can’t just feed raw head coordinates into a neural network and expect good results. If you train a model on someone walking in a circle at x=0, y=0, it might struggle to understand someone walking in a circle at x=50, y=50.
The paper argues that a robust representation must satisfy two criteria:
- Spatial Invariance: The motion logic should be the same regardless of where in the room you are.
- Temporal Invariance: The motion logic should be the same regardless of when in a sequence an action occurs.
Why Prior Methods Failed
Previous approaches often violated one of these rules.
- Absolute Poses: Violate spatial invariance. The model overfits to specific world coordinates.
- Sequence Canonicalization: Some methods define a coordinate frame based on the first frame of a video clip. This fixes spatial issues but breaks temporal invariance. If you chop the same video into different windows, the “canonical” frame changes, confusing the model.
To visualize the problem with absolute poses, look at Figure A.1 below. If the world frame is defined arbitrarily, the same relative motion looks completely different to the computer.

The Solution: Locally Canonicalized Frames
EgoAllo proposes a new parameterization function, \(g(\cdot)\), that achieves both spatial and temporal invariance.
The raw input is the pose of the “Central Pupil Frame” (CPF)—a virtual point between the user’s eyes.
\[ \begin{array} { r l } & { \mathbf { T } _ { \mathrm { w o r l d , c p f } } ^ { t } = ( \mathbf { R } _ { \mathrm { w o r l d , c p f } } ^ { t } , \mathbf { p } _ { \mathrm { w o r l d , c p f } } ^ { t } ) \in \mathrm { S E } ( 3 ) , } \\ & { \{ \vec { c } ^ { 1 } , . . . , \vec { c } ^ { T } \} = g ( \{ \mathbf { T } _ { \mathrm { w o r l d , c p f } } ^ { 1 } , . . . , \mathbf { T } _ { \mathrm { w o r l d , c p f } } ^ { T } \} ) . } \end{array} \]
To make this invariant, they first look at the relative motion between time steps. How much did the head move from time \(t-1\) to \(t\), relative to the head’s own orientation?
\[ \Delta \mathbf { T } _ { \mathrm { c p f } } ^ { t - 1 , t } { = } ( \mathbf { T } _ { \mathrm { w o r l d , c p f } } ^ { t - 1 } ) ^ { - 1 } \mathbf { T } _ { \mathrm { w o r l d , c p f } } ^ { t } . \]
However, relative motion isn’t enough. It doesn’t tell you if the head is upside down or how high off the floor it is. To solve this, the authors introduce Per-Timestep Canonical Frames.
Instead of defining one reference frame for a whole video clip, they compute a new reference frame for every single millisecond. This frame is projected onto the floor directly beneath the current head position.
![Figure 3. Locally canonicalized coordinate frames. We compute our invariant conditioning parameterization (Equation 4) using transformations computed from three coordinate frames. Following [85], the CPF has the z-axis forward. Following HuMoR [74], the world and canonical z-axes point up. Canonical frames are computed by projecting the CPF frame origin to the ground plane, then aligning the canonical y-axis to the CPF forward direction.](/en/paper/2410.03665/images/005.jpg#center)
This projection aligns the coordinate system with gravity (the floor) and the user’s current facing direction. The full conditioning vector \(\vec{c}^t\) combines the relative motion with the head’s pose relative to this immediate floor frame:
\[ \vec { c } ^ { t } = \biggl \{ \Delta \mathbf { T } _ { \mathrm { c p f } } ^ { t - 1 , t } , \quad ( \mathbf { T } _ { \mathrm { w o r l d , c a n o n i c a l } } ^ { t } ) ^ { - 1 } \mathbf { T } _ { \mathrm { w o r l d , c p f } } ^ { t } \biggr \} . \]
By doing this, the model learns “how a body moves when the head is \(X\) meters above the floor and rotating by \(Y\) degrees,” regardless of where in the room the person actually is.
To verify this visually, look at Figure A.4 below. This diagram shows how the transformations are chained together to create the invariant conditioning.

2. The Diffusion Model
With the inputs properly conditioned, EgoAllo uses a standard Transformer-based diffusion model. The goal is to predict the clean body parameters \(\vec{x}_0\) (joint angles, body shape, contact info) from a noisy input \(\vec{x}_n\).
\[ \operatorname* { m i n } _ { \theta } \ \mathbb { E } _ { \vec { x } _ { 0 } } \mathbb { E } _ { n \sim \mathcal { U } } \big [ w _ { n } \big \| \mu _ { \theta } ( \vec { x } _ { n } , n , \vec { c } ) - \vec { x } _ { 0 } \big \| ^ { 2 } \big ] . \]
The model predicts local parameters—meaning the angles of the elbows and knees, but not where the person is in the room. This decoupling is crucial. The model learns how bodies articulate, and the SLAM system handles where the body is located.
3. Global Alignment and Guidance
Once the model generates a sequence of body poses, we need to place them back into the room. Because we know the head position from SLAM (\(T_{world,cpf}\)), and the model predicted the body relative to the head, we can chain the transforms:
\[ \begin{array} { r } { \mathbf { T } _ { \mathrm { w o r l d , r o o t } } ^ { t } \mathrm { = } \mathbf { T } _ { \mathrm { w o r l d , c p f } } ^ { t } \mathbf { T } _ { \mathrm { c p f , r o o t } } ^ { ( \Theta ^ { t } , \beta ^ { t } ) } , } \end{array} \]
This equation places the virtual character’s pelvis (root) in the exact spot required to make its head line up with the smart glasses.
Visual Guidance (The Hand Estimator)
The diffusion model provides a strong “prior” (a guess based on training data). But we also have video! If the video shows hands, we should use that information.
EgoAllo uses an optimization step during inference. It detects hands using an off-the-shelf detector called HaMeR and then nudges the diffusion generation to align with these visual cues.
The guidance loss function is a sum of three parts:
\[ \mathcal { E } _ { \mathrm { g u i d a n c e } } ^ { ( \Theta ) } { = } \mathcal { E } _ { \mathrm { h a n d s } } ^ { ( \Theta ) } { + } \mathcal { E } _ { \mathrm { s k a t e } } ^ { ( \Theta ) } { + } \mathcal { E } _ { \mathrm { p r i o r } } ^ { ( \Theta ) } . \]
- Hands: Make the generated hands match the video detection.
- Skate: Minimize “foot skating” (when feet slide unnaturally on the floor).
- Prior: Don’t deviate too far from a natural human pose.
Specifically for the hands, the system uses a reprojection loss. It projects the 3D hand joints back onto the 2D camera image and checks the error against the detected 2D keypoints.
\[ \begin{array} { r l } & { \mathcal { E } _ { \mathrm { r e p r o j } } ^ { ( \Theta ) } = \displaystyle \sum _ { t , j \in \mathcal { H } } \vert \vert \Pi _ { K } ( \mathbf { p } _ { \mathrm { c a m e r a } , j } ^ { ( \Theta ^ { t } ) } ) - \Pi _ { K } ( \hat { \mathbf { p } } _ { \mathrm { c a m e r a } , j } ^ { t } ) \vert \vert _ { 2 } ^ { 2 } , } \\ & { \mathbf { p } _ { \mathrm { c a m e r a } , j } ^ { ( \Theta ^ { t } ) } = \mathbf { T } _ { \mathrm { c a m e r a } , \mathrm { c p f } } ( \mathbf { T } _ { \mathrm { w o r l d } , \mathrm { c p f } } ^ { t } ) ^ { - 1 } \mathbf { p } _ { \mathrm { w o r l d } , j } ^ { ( \Theta ^ { t } ) } . } \end{array} \]
This optimization is performed using the Levenberg-Marquardt algorithm, which the authors found converges much faster than standard optimizers like Adam for this specific task.

Experiments and Results
Does all this complex conditioning and guidance actually work? The authors tested EgoAllo on several datasets, including AMASS (motion capture data) and real-world footage from Project Aria.
Does Invariant Conditioning Matter?
The first test was to see if the “spatial and temporal invariance” theory held up. They trained identical models with different input parameterizations.

Table 1 shows the results. EgoAllo (Eq. 4) significantly outperforms other methods.
- MPJPE (Mean Per-Joint Position Error): Lower is better. EgoAllo achieves 129.8mm error on short sequences.
- Comparison: The naive “Absolute” parameterization has a massive error of 159.9mm. The “Sequence Canonicalization” method (used in prior work like EgoEgo) sits at 153.1mm.
- Takeaway: The invariant conditioning leads to an accuracy improvement of nearly 18% over the closest competitor.
Qualitative Comparison
Numbers are good, but visuals are better. In Figure 4 below, we see a comparison of a running sequence.

- Ground Truth (a): The actual motion.
- EgoAllo (b): Very close to ground truth. The posture is natural, and the stride is captured well.
- EgoEgo (c): Notice the distortions. The character looks hunched, and the limbs are not positioning correctly relative to the head motion.
Hand Estimation Improvements
One of the most interesting findings is that estimating the whole body actually helps estimate the hands better.
If you just run a hand detector (like HaMeR) on a video frame, it has no context. It might think a hand is 1 meter away when it’s actually 0.5 meters away. It might jitter wildly between frames.
By attaching the hands to a full body, EgoAllo enforces kinematic constraints. The hand can’t be 3 meters away because the arm isn’t that long.

Figure 5 illustrates this. In the top row, we see interaction with a touchscreen. The purple EgoAllo estimates are physically consistent—the fingers touch the screen. In the bottom row, we see how EgoAllo resolves depth ambiguities that plague monocular estimators.
The quantitative improvement is drastic:

As shown in Table 3, the error drops from 237.90mm (HaMeR alone) to 131.45mm (EgoAllo-Mono). If stereo cameras are available (EgoAllo-Wrist3D), the error drops even further to 60.08mm.
Conclusion
EgoAllo represents a significant step forward in egocentric vision. By carefully designing how sensor data is fed into the model—specifically through spatially and temporally invariant conditioning—the researchers turned a noisy, ambiguous problem into a tractable one.
The system demonstrates that you don’t always need external cameras to track a human. With just a pair of smart glasses, we can infer the position of the feet, the posture of the spine, and the precise location of hands in the world.
Key Takeaways:
- Representation is King: How you format your input data (Canonicalization) can be just as important as the model architecture itself.
- Whole-Body Context: Solving for the whole body (torso + legs) provides constraints that improve the accuracy of smaller parts (hands).
- Hybrid Approach: Combining generative AI (Diffusion) with classical optimization (Guidance) allows us to leverage the best of learned priors and real-time sensor data.
As AR glasses become more common, techniques like EgoAllo will likely be the engine under the hood, ensuring that our digital avatars move through the world just as seamlessly as we do.
](https://deep-paper.org/en/paper/2410.03665/images/cover.png)