Reconstructing a 3D object from a single 2D image is one of computer vision’s classic “ill-posed” problems. When you look at a photograph of a galloping horse, your brain instantly understands the 3D shape, the articulation of the limbs, and the parts of the animal that are hidden from view. For a computer, however, inferring this geometry from a grid of pixels is incredibly difficult, especially when the object is deformable—meaning it can bend, stretch, and move (like animals or humans).

Traditionally, researchers have tackled this by fitting complex 3D mesh templates to images or solving rigorous geometric equations. But recently, a new paradigm has emerged: asking neural networks to directly predict the geometry pixel-by-pixel.

In this post, we will dive into DualPM, a research paper that introduces a clever new representation called Dual Point Maps. By predicting not one, but two geometric maps for a single image, this method successfully recovers the 3D shape and pose of articulated animals, even tackling the “invisible” parts of the object through amodal reconstruction.

The Shift to Point Maps

To understand DualPM, we first need to look at the concept of a Point Map.

Historically, deep learning approaches for 3D often predicted depth maps (the distance of each pixel from the camera). While useful, depth maps are limited; they don’t inherently tell you where a point is in 3D space without knowing the camera’s intrinsic parameters (like focal length).

Recent breakthroughs, such as DUSt3R, proposed a different approach: instead of predicting depth, why not train the network to output a Point Map \(P\)? A point map associates every pixel \(u\) in an image with a specific 3D coordinate \((x, y, z)\) in the camera’s coordinate system.

\[ u \rightarrow P(u) \in \mathbb{R}^3 \]

This simplifies many geometric tasks. However, for deformable objects, a standard point map has a major weakness: it gives you the shape, but it doesn’t tell you anything about the pose. If you reconstruct a running horse, you get a 3D mesh of a running horse, but the computer doesn’t know that “this point” corresponds to the “left knee” or that the leg is currently bent at a 45-degree angle relative to a resting state.

The Core Concept: Dual Point Maps

This is where DualPM innovates. The authors argue that to fully understand a deformable object, we need to solve two problems simultaneously:

  1. Where is this point in the current scene? (Posed)
  2. What is this point’s identity on the object? (Canonical)

To achieve this, the network predicts Dual Point Maps from a single input image.

1. The Posed Map (\(P\))

This is the standard reconstruction. It maps pixel \(u\) to its 3D location in the camera frame. It represents the object exactly as it appears in the image, with its current deformation and orientation.

2. The Canonical Map (\(Q\))

This is the game-changer. This map links pixel \(u\) to a corresponding 3D point on a neutral, rest-pose version of the object.

Imagine a “template” horse standing perfectly still. The Canonical Map tells us that a specific pixel in the photograph corresponds to a specific coordinate on that template horse. This map is pose-invariant. Whether the horse in the photo is jumping, sleeping, or running, the “nose” pixel always maps to the “nose” coordinate in canonical space.

Figure 1. Left: We map an image of an object to its Dual Point Maps (DualPMs), a pair of point maps P, defined in a camera space, and Q, defined in a canonical space. Right: Visualization of results.

As shown in Figure 1, the relationship between these two maps is powerful. The difference between the Posed Map (\(P\)) and the Canonical Map (\(Q\)) represents the deformation flow (\(P - Q\)). It mathematically describes how the object has been twisted and moved from its neutral state to the observed state.

With this dual representation, we get correspondence for free. If we want to find the same point on a horse across two different images (\(I_1\) and \(I_2\)), we simply look for pixels that map to the same Canonical coordinate:

Equation describing correspondence matching between two images using Canonical maps.

Architecture: How to Predict the Maps

The researchers designed a neural network architecture specifically to leverage the relationship between \(P\) and \(Q\). It doesn’t just predict them side-by-side; it uses one to help find the other.

Figure 3. Method overview showing the architecture. Input image I is processed into features F. A canonical predictor generates Q. Conditioned on Q, the posed predictor generates P.

The process, illustrated in Figure 3, follows these steps:

  1. Feature Extraction: The input image \(I\) is passed through a feature extractor (using pre-trained weights like DINOv2) to get rich image features \(F\).
  2. Predicting Canonical (\(Q\)): The network first predicts the Canonical Map. This makes sense because predicting \(Q\) is essentially a classification task—assigning a “body part identity” to each pixel. This is easier for networks to learn because it is invariant to the camera angle or the animal’s pose.
  3. Predicting Posed (\(P\)): Here is the clever architectural choice. The prediction of the Posed Map \(P\) is conditioned on the predicted Canonical Map \(Q\).

Equation showing P is a function of Q.

By feeding the “identity” of the points (\(Q\)) into the predictor for the geometry (\(P\)), the network knows what it is reconstructing before it decides where to place it. This significantly improves generalization.

Seeing the Invisible: Amodal Reconstruction

A major limitation of standard point maps is that they only represent what the camera can see. If a horse is viewed from the side, the other side of the body is lost. This is called self-occlusion.

To reconstruct the complete 3D shape, DualPM introduces Amodal Point Maps.

Instead of mapping a pixel to a single 3D point, the network maps a pixel to a sequence of points. Imagine shooting a ray from the camera through a specific pixel. That ray might hit the visible side of the horse (entry point), pass through the body, and exit the other side (exit point).

Figure 2. An amodal point map diagram showing a ray passing through a torus object, intersecting at multiple points p1, p2, p3, p4.

As visualized in Figure 2, the network predicts layers of points \((p_1, p_2, \dots)\).

  • Layer 1: The visible surface.
  • Layer 2: The back of the surface (hidden).
  • Subsequent Layers: Any other folds or limbs the ray might pass through.

By predicting these ordered layers, the method reconstructs the entire volume of the object, not just the visible shell.

Training with Synthetic Data

Training deep learning models for 3D usually requires massive amounts of data. Getting ground-truth 3D data for real animals is nearly impossible (you can’t easily put a motion capture suit on a wild zebra).

The authors solve this by training entirely on synthetic data. They use the “Animodel” dataset, which consists of rigged 3D models of animals.

Figure 4. Synthetic training data showing rendered horses in various poses and environments.

The training pipeline (Figure 4) involves:

  1. Taking a 3D model (e.g., a horse).
  2. Randomly articulating it (posing).
  3. Rendering it from random viewpoints with random lighting.
  4. Generating the ground truth \(P\) and \(Q\) maps from the renderer.

Remarkably, the model is trained on only one or two 3D assets per category. One might expect the model to overfit to that specific 3D mesh. However, because the network learns to predict the canonical coordinates (the structural identity), it generalizes surprisingly well to real photos of different animals within the same category.

To enforce learning, they use a loss function that compares the predicted points to the ground truth, weighted by confidence:

Equation for the loss function involving confidence scores.

Experimental Results

The DualPM method shows impressive results when applied to real-world images, significantly outperforming prior state-of-the-art methods like 3D-Fauna.

Qualitative Comparison

In Figure 5 below, you can see the difference in reconstruction quality.

  • Input View: The reconstruction overlaid on the original image.
  • Novel Views: The reconstruction rotated to show the 3D structure.

DualPM (Ours) recovers intricate details and correct articulation, whereas 3D-Fauna often produces “mushy” or distorted shapes when the animal is in a difficult pose (like the jumping horse in the top row).

Figure 5. Comparison with state-of-the-art. DualPM shows sharper, more accurate 3D reconstructions of horses, cows, and sheep compared to 3D-Fauna.

Why Conditioning Matters

The authors performed an ablation study to prove that conditioning the posed map \(P\) on the canonical map \(Q\) was crucial.

In Figure 7, we see that when \(P\) is conditioned on \(Q\) alone (bottom rows), the generalization is robust. If the network relies too much on raw image features \(F\) (top rows), it might get confused by textures or lighting that it didn’t see in the synthetic training set. \(Q\) acts as a stable, semantic bridge.

Figure 7. Ablation study showing that conditioning solely on Canonical maps Q leads to better generalization than conditioning on image features F.

Generalization to New Species

Perhaps the most surprising result is the model’s zero-shot capabilities. A model trained only on a horse was tested on images of cows and sheep. Because these quadrupeds share a similar topological structure (four legs, head, tail), the “canonical” understanding of the horse transferred reasonably well to the other animals.

Figure A8. Results on unseen categories. A model trained only on horses generalizes to reconstruct cows and sheep.

Applications: Skeleton Fitting and Animation

Because DualPM provides the Canonical Map (\(Q\)), we know exactly where every part of the mesh should be in a neutral pose. This makes it trivial to fit a 3D skeleton to the reconstruction.

Once the skeleton is fitted, the static image can be brought to life. The researchers demonstrated that they could take a single photo of a horse, reconstruct it, fit a skeleton, and then apply motion capture data to animate the horse running.

Figure 6. Animation. Showing how the extracted skeleton allows for motion retargeting and animation of the reconstructed horse.

Conclusion

Dual Point Maps represent a significant step forward in 3D computer vision. By decomposing the problem into “where is it” (\(P\)) and “what is it” (\(Q\)), the researchers turned a complex geometric inference problem into a manageable learning task.

The ability to train on minimal synthetic data and generalize to real-world images suggests that this representation captures something fundamental about the geometry of deformable objects. Furthermore, the introduction of amodal layers solves the persistent problem of self-occlusion, allowing computers to “imagine” the unseen side of an object much like humans do.

For students and practitioners in the field, DualPM highlights the importance of choosing the right representation. Sometimes, the key to solving a hard problem isn’t a bigger network, but a smarter way of defining the output.