HSfM: Unifying Humans and Structure from Motion for a Metric 3D World

Imagine you have a handful of photos taken by different people at a party. You want to recreate that exact moment in 3D: the room layout, where everyone was standing, and where the photographers were standing.

In computer vision, this is a classic divide. We have excellent algorithms for reconstructing the static scene (walls, furniture), known as Structure from Motion (SfM). We also have great models for reconstructing humans (poses, body shapes). But historically, these two fields have been like oil and water. SfM algorithms treat moving humans as “noise” to be filtered out, while human reconstruction methods often generate “floating” avatars with no sense of the floor or the room scale.

This blog post explores a research paper titled “Reconstructing People, Places, and Cameras”, which introduces a method called Humans and Structure from Motion (HSfM). This approach bridges the gap, proving that by solving for people, scenes, and cameras simultaneously, we get a better result for all three.

Figure 1. Humans and Structure from Motion (HSfM). The left shows input images of people in a room. The right shows the joint 3D reconstruction: colored human meshes, a point cloud of the room, and the camera positions (green screens).

As shown in Figure 1, the goal is to take a sparse set of uncalibrated images (left) and produce a fully consistent 3D world (right) where the people are anchored to the floor, the furniture is mapped out, and the cameras are placed correctly in space.

The Problem: Two Solitudes

To appreciate HSfM, we need to understand the limitations of the current landscape.

Structure from Motion (SfM): Traditional methods (like COLMAP) or newer deep learning methods (like DUSt3R) are great at finding matching points in static scenes to build a 3D point cloud. However, they struggle with scale. Without a reference object of known size, an SfM reconstruction doesn’t know if a room is 5 meters wide or 50 meters wide. It creates an “up-to-scale” world. Furthermore, these methods usually ignore people entirely.
Human Mesh Recovery (HMR): Methods like HMR2 can look at a photo and predict a 3D mesh of a person’s body. While the pose (where the arms and legs are relative to the body) is usually accurate, the global placement is often a guess. You frequently end up with meshes that look like they are floating in mid-air or penetrating the floor because the algorithm lacks context about the environment.

The Solution? Synergy. The authors of HSfM propose that these two problems are actually each other’s solutions. Humans have a statistical range of heights and sizes. By including humans in the reconstruction, we can use them as “metric rulers” to solve the scale ambiguity of the scene. Conversely, a solid reconstruction of the scene and camera positions can constrain the humans, forcing them to stand on the ground rather than float in the void.

The HSfM Pipeline

The HSfM framework is an optimization pipeline. It doesn’t train a neural network from scratch; instead, it cleverly combines the outputs of existing state-of-the-art models and refines them through a joint optimization process.

Let’s look at the high-level workflow:

Figure 2. Pipeline of Humans and Structure from Motion. Left: Preprocessing extracts 2D keypoints, human meshes, and scene pointmaps. Center: Joint optimization refines everything. Right: The final metric 3D output.

As illustrated in Figure 2, the process moves from raw images to a polished 3D world in three main stages: Preprocessing, Initialization, and Joint Optimization.

1. Preprocessing and Definitions

The system takes \(C\) uncalibrated images as input. It assumes we know which person is which across the different photos (known person re-identification).

Before optimizing, the system gathers initial guesses using “foundational models”:

For Humans: It uses ViTPose to detect 2D keypoints (joints) in the images and HMR2 to regress an initial 3D mesh (SMPL-X model).
For Scene & Cameras: It uses DUSt3R, a recent powerful tool that outputs “pointmaps” (dense 3D geometry) and initial camera parameters.

Let’s define the variables we are trying to solve for.

The Human Model: Each human \(h\) is represented by the SMPL-X parameters. Equation 1: The human parameters set H including orientation phi, pose theta, shape beta, and translation gamma. Here, \(\phi\) is the global orientation, \(\theta\) is the body pose, \(\beta\) is the body shape (fat/thin/tall), and \(\gamma\) is the translation (position) in the world.

The Camera Model: Standard SfM results are scale-ambiguous. To fix this, the authors introduce a scaling parameter \(\alpha\). Equation 2: The 2D projection equation mapping a 3D point to 2D pixel coordinates using intrinsics K, rotation R, scale alpha, and translation t. This equation projects a 3D point (\(x_{3D}\)) onto the image. Note the \(\alpha t\) term. This allows the system to adjust the scale of the scene to match the metric size of the humans.

The Scene (Pointmaps): The scene is represented as “pointmaps” \(S\). Think of a pointmap as an image where every pixel contains the \((x, y, z)\) coordinate of that point in the 3D world. Equation 3: Formula for unprojecting a depth pixel into world coordinates.

2. Initialization of the World

This is a critical step. If you start an optimization with random values, it will likely fail. The authors need to align the “data-driven SfM world” (from DUSt3R) with the “human world” (from HMR2).

Since standard SfM doesn’t know if a unit of distance is a millimeter or a kilometer, the initial scene might be tiny or huge. If the scene is initialized too small, the cameras might end up placed inside the human bodies, causing the math to break.

To solve this, the method calculates a scale factor \(\alpha\) analytically before optimization begins.

Step A: Rotate Cameras. They first align the cameras based on the humans. Since a person’s “up” direction should be consistent, they can derive camera rotations relative to the humans. Equation 4: Deriving camera rotation relative to a reference camera using human orientation estimates.

Step B: Position Cameras. They estimate where the cameras are by using the human as a pivot. By looking at where the human is in the image versus where they are in 3D, they can triangulate the camera’s position \(\hat{T}^c\). Equation 5: Recovering camera position T in world coordinates based on human location gamma.

Step C: Solve for Scale. Finally, they compare the camera positions derived from the humans (\(\hat{T}\)) against the camera positions predicted by the SfM tool (\(\tilde{T}\)). The ratio between these distances gives the initial scale \(\alpha\). This effectively “sizes” the room to match the people.

3. Joint Optimization

Now that the world is roughly aligned, the system runs a heavy optimization to refine everything. This is the core contribution of the paper.

The objective function minimizes error across all components simultaneously: Equation 6: The main objective function minimizing the sum of Human Loss and Places (Scene) Loss.

This formula balances two goals: making the humans look right (\(L_{Humans}\)) and making the scene look right (\(L_{Places}\)). Let’s break them down.

The Human Loss (\(L_{Humans}\))

This is based on Bundle Adjustment, a classic technique in 3D vision. Equation 7: The Human Loss decomposition into Joint reprojection loss and Shape regularization loss.

The primary component is the reprojection error (\(L_J\)). The system projects the 3D human joints back onto the 2D images and checks how far they are from the detected 2D keypoints (from ViTPose). Equation 8: The detailed reprojection error equation calculating the distance between detected 2D joints and projected 3D joints. By minimizing this error, the system adjusts the camera positions and the human’s 3D position until the alignment is perfect.

They also add a small regularization term (\(L_\beta\)) to prevent the human shapes from distorting into unrealistic monsters: Equation 9: Shape regularization term ensuring human body shape parameters beta stay close to the mean.

The Places Loss (\(L_{Places}\))

This term ensures the static scene is consistent across different views. It uses a “Global Alignment Loss” adapted from DUSt3R. Equation 10: The Global Alignment Loss equation aligning pointmaps from different views into a unified world coordinate system. In plain English: If Camera A sees a chair, and Camera B sees the same chair, the 3D points calculated from both cameras should land in the exact same spot in the world. This term forces the per-view depth maps to merge into a single, crisp 3D environment.

Experiments and Results

The researchers tested HSfM on two challenging datasets: EgoHumans and EgoExo4D. These datasets feature people performing complex activities (fencing, playing basketball, cooking) captured by multiple cameras.

Quantitative Success

The results, presented in Table 2, show massive improvements over existing baselines.

Table 2. Evaluation on EgoHumans and EgoExo4D. HSfM drastically reduces World MPJPE (human position error) and Camera Translation Error (TE) compared to baselines like UnCaliPose and MASt3R.

Key takeaways from the data:

Human Placement (W-MPJPE): On EgoHumans, the error in placing humans in the world dropped from 3.51 meters (using UnCaliPose) to just 1.04 meters with HSfM. That is a 3.5x improvement.
Metric Scale: The method successfully recovers the true metric scale. Previous SfM methods are “scale-less,” meaning they can’t tell you real-world distances. HSfM can.
Camera Accuracy: By using humans as constraints, the camera position estimates actually improved compared to using scene-only methods (DUSt3R/MASt3R). The synergy works both ways!

Qualitative Magic

Numbers are great, but visual results tell the story better.

In Figure 3 below, you can see the difference between the initial state and the optimized result.

Before (Initial state): Look at panels (a), (c), and (e). The people often appear to be floating, or the scene geometry is messy.
After (Optimization): In panels (b), (d), and (f), the people are grounded. Their feet touch the floor. The scene point cloud is sharper.

Figure 3. Qualitative results comparing initial states versus optimized HSfM results. The optimization grounds floating humans and corrects scene scale.

The method works even in “wild” scenarios with minimal setups, such as images taken from just two smartphones.

Figure S.1. In-the-wild results using just two smartphones. (a) Input images. (b) Scene reconstruction without human meshes (people are just noisy points). (c) Full HSfM result with human meshes integrated.

Notice in the image above (Figure S.1) how the “Human Point Cloud” in (b) is just a sparse, noisy blob. But in (c), HSfM replaces that noise with a clean, pose-accurate mesh that sits perfectly in the environment.

Another example shows the correction of “foot-skate”—where a 3D model looks like it’s sliding on ice rather than standing.

Figure S.2. Close-up of in-the-wild reconstruction. HSfM ensures accurate human-scene contact, such as the person’s right foot resting on the box.

Finally, let’s look at a complex indoor scene from the EgoExo4D dataset.

Figure S.4. Comparisons on EgoExo4D. Left: Input images. Center: Before optimization (fragmented scene, floating humans). Right: After optimization (coherent room structure, grounded humans).

The center column of Figure S.4 shows the “Before” state—a fragmented, exploded-looking scene. The right column (“After”) shows a coherent room. This visually confirms that the Places Loss (aligning the scene) and the Human Loss (aligning the people) pull the entire 3D world together.

Why This Matters

The “Humans and Structure from Motion” paper is significant because it moves us closer to holistic scene understanding.

Metric Scale for Free: By leveraging the biological priors of humans (we know roughly how big people are), we can reconstruct rooms in metric scale without needing depth sensors or calibration targets.
Better Cameras: It turns out that people are not just noise; they are valuable geometric cues. Tracking people helps stabilize camera tracking.
No Contact Constraints Needed: Many previous methods forced feet to touch the ground explicitly (which fails if someone jumps). HSfM achieves grounding naturally through geometric consistency, allowing it to reconstruct jumping or dynamic motions accurately.

Conclusion

HSfM successfully unifies the reconstruction of people, places, and cameras into a single optimization framework. By initializing with robust foundational models and refining with a combined loss function, it achieves a level of consistency that neither SfM nor Human Mesh Recovery could achieve alone.

For students and researchers in Computer Vision, this highlights a growing trend: the move away from isolated tasks (just detecting pose, just building maps) toward joint systems where different components inform and improve each other.

If you are interested in the mathematical proofs or the code, the authors have made their resources available at the project page linked in their paper.

The Problem: Two Solitudes#

The HSfM Pipeline#

1. Preprocessing and Definitions#

2. Initialization of the World#

3. Joint Optimization#

The Human Loss (\(L_{Humans}\))#

The Places Loss (\(L_{Places}\))#

Experiments and Results#

Quantitative Success#

Qualitative Magic#

Why This Matters#

Conclusion#