Reconstructing a 3D object from a single 2D image is one of computer vision’s classic “ill-posed” problems. When you look at a photograph of a galloping horse, your brain instantly understands the 3D shape, the articulation of the limbs, and the parts of the animal that are hidden from view. For a computer, however, inferring this geometry from a grid of pixels is incredibly difficult, especially when the object is deformable—meaning it can bend, stretch, and move (like animals or humans).
Traditionally, researchers have tackled this by fitting complex 3D mesh templates to images or solving rigorous geometric equations. But recently, a new paradigm has emerged: asking neural networks to directly predict the geometry pixel-by-pixel.
In this post, we will dive into DualPM, a research paper that introduces a clever new representation called Dual Point Maps. By predicting not one, but two geometric maps for a single image, this method successfully recovers the 3D shape and pose of articulated animals, even tackling the “invisible” parts of the object through amodal reconstruction.
The Shift to Point Maps
To understand DualPM, we first need to look at the concept of a Point Map.
Historically, deep learning approaches for 3D often predicted depth maps (the distance of each pixel from the camera). While useful, depth maps are limited; they don’t inherently tell you where a point is in 3D space without knowing the camera’s intrinsic parameters (like focal length).
Recent breakthroughs, such as DUSt3R, proposed a different approach: instead of predicting depth, why not train the network to output a Point Map \(P\)? A point map associates every pixel \(u\) in an image with a specific 3D coordinate \((x, y, z)\) in the camera’s coordinate system.
\[ u \rightarrow P(u) \in \mathbb{R}^3 \]This simplifies many geometric tasks. However, for deformable objects, a standard point map has a major weakness: it gives you the shape, but it doesn’t tell you anything about the pose. If you reconstruct a running horse, you get a 3D mesh of a running horse, but the computer doesn’t know that “this point” corresponds to the “left knee” or that the leg is currently bent at a 45-degree angle relative to a resting state.
The Core Concept: Dual Point Maps
This is where DualPM innovates. The authors argue that to fully understand a deformable object, we need to solve two problems simultaneously:
- Where is this point in the current scene? (Posed)
- What is this point’s identity on the object? (Canonical)
To achieve this, the network predicts Dual Point Maps from a single input image.
1. The Posed Map (\(P\))
This is the standard reconstruction. It maps pixel \(u\) to its 3D location in the camera frame. It represents the object exactly as it appears in the image, with its current deformation and orientation.
2. The Canonical Map (\(Q\))
This is the game-changer. This map links pixel \(u\) to a corresponding 3D point on a neutral, rest-pose version of the object.
Imagine a “template” horse standing perfectly still. The Canonical Map tells us that a specific pixel in the photograph corresponds to a specific coordinate on that template horse. This map is pose-invariant. Whether the horse in the photo is jumping, sleeping, or running, the “nose” pixel always maps to the “nose” coordinate in canonical space.

As shown in Figure 1, the relationship between these two maps is powerful. The difference between the Posed Map (\(P\)) and the Canonical Map (\(Q\)) represents the deformation flow (\(P - Q\)). It mathematically describes how the object has been twisted and moved from its neutral state to the observed state.
With this dual representation, we get correspondence for free. If we want to find the same point on a horse across two different images (\(I_1\) and \(I_2\)), we simply look for pixels that map to the same Canonical coordinate:

Architecture: How to Predict the Maps
The researchers designed a neural network architecture specifically to leverage the relationship between \(P\) and \(Q\). It doesn’t just predict them side-by-side; it uses one to help find the other.

The process, illustrated in Figure 3, follows these steps:
- Feature Extraction: The input image \(I\) is passed through a feature extractor (using pre-trained weights like DINOv2) to get rich image features \(F\).
- Predicting Canonical (\(Q\)): The network first predicts the Canonical Map. This makes sense because predicting \(Q\) is essentially a classification task—assigning a “body part identity” to each pixel. This is easier for networks to learn because it is invariant to the camera angle or the animal’s pose.
- Predicting Posed (\(P\)): Here is the clever architectural choice. The prediction of the Posed Map \(P\) is conditioned on the predicted Canonical Map \(Q\).

By feeding the “identity” of the points (\(Q\)) into the predictor for the geometry (\(P\)), the network knows what it is reconstructing before it decides where to place it. This significantly improves generalization.
Seeing the Invisible: Amodal Reconstruction
A major limitation of standard point maps is that they only represent what the camera can see. If a horse is viewed from the side, the other side of the body is lost. This is called self-occlusion.
To reconstruct the complete 3D shape, DualPM introduces Amodal Point Maps.
Instead of mapping a pixel to a single 3D point, the network maps a pixel to a sequence of points. Imagine shooting a ray from the camera through a specific pixel. That ray might hit the visible side of the horse (entry point), pass through the body, and exit the other side (exit point).

As visualized in Figure 2, the network predicts layers of points \((p_1, p_2, \dots)\).
- Layer 1: The visible surface.
- Layer 2: The back of the surface (hidden).
- Subsequent Layers: Any other folds or limbs the ray might pass through.
By predicting these ordered layers, the method reconstructs the entire volume of the object, not just the visible shell.
Training with Synthetic Data
Training deep learning models for 3D usually requires massive amounts of data. Getting ground-truth 3D data for real animals is nearly impossible (you can’t easily put a motion capture suit on a wild zebra).
The authors solve this by training entirely on synthetic data. They use the “Animodel” dataset, which consists of rigged 3D models of animals.

The training pipeline (Figure 4) involves:
- Taking a 3D model (e.g., a horse).
- Randomly articulating it (posing).
- Rendering it from random viewpoints with random lighting.
- Generating the ground truth \(P\) and \(Q\) maps from the renderer.
Remarkably, the model is trained on only one or two 3D assets per category. One might expect the model to overfit to that specific 3D mesh. However, because the network learns to predict the canonical coordinates (the structural identity), it generalizes surprisingly well to real photos of different animals within the same category.
To enforce learning, they use a loss function that compares the predicted points to the ground truth, weighted by confidence:

Experimental Results
The DualPM method shows impressive results when applied to real-world images, significantly outperforming prior state-of-the-art methods like 3D-Fauna.
Qualitative Comparison
In Figure 5 below, you can see the difference in reconstruction quality.
- Input View: The reconstruction overlaid on the original image.
- Novel Views: The reconstruction rotated to show the 3D structure.
DualPM (Ours) recovers intricate details and correct articulation, whereas 3D-Fauna often produces “mushy” or distorted shapes when the animal is in a difficult pose (like the jumping horse in the top row).

Why Conditioning Matters
The authors performed an ablation study to prove that conditioning the posed map \(P\) on the canonical map \(Q\) was crucial.
In Figure 7, we see that when \(P\) is conditioned on \(Q\) alone (bottom rows), the generalization is robust. If the network relies too much on raw image features \(F\) (top rows), it might get confused by textures or lighting that it didn’t see in the synthetic training set. \(Q\) acts as a stable, semantic bridge.

Generalization to New Species
Perhaps the most surprising result is the model’s zero-shot capabilities. A model trained only on a horse was tested on images of cows and sheep. Because these quadrupeds share a similar topological structure (four legs, head, tail), the “canonical” understanding of the horse transferred reasonably well to the other animals.

Applications: Skeleton Fitting and Animation
Because DualPM provides the Canonical Map (\(Q\)), we know exactly where every part of the mesh should be in a neutral pose. This makes it trivial to fit a 3D skeleton to the reconstruction.
Once the skeleton is fitted, the static image can be brought to life. The researchers demonstrated that they could take a single photo of a horse, reconstruct it, fit a skeleton, and then apply motion capture data to animate the horse running.

Conclusion
Dual Point Maps represent a significant step forward in 3D computer vision. By decomposing the problem into “where is it” (\(P\)) and “what is it” (\(Q\)), the researchers turned a complex geometric inference problem into a manageable learning task.
The ability to train on minimal synthetic data and generalize to real-world images suggests that this representation captures something fundamental about the geometry of deformable objects. Furthermore, the introduction of amodal layers solves the persistent problem of self-occlusion, allowing computers to “imagine” the unseen side of an object much like humans do.
For students and practitioners in the field, DualPM highlights the importance of choosing the right representation. Sometimes, the key to solving a hard problem isn’t a bigger network, but a smarter way of defining the output.
](https://deep-paper.org/en/paper/2412.04464/images/cover.png)