Introduction: The Human Ability to “Hallucinate” Geometry
Imagine you are standing in a classroom. You take a photo of the blackboard at the front. Then, you turn around and walk to the back of the room, taking a photo of a student’s desk. These two photos have zero overlap—there are no common visual features between them.
If you feed these two images into a traditional computer vision algorithm and ask, “Where is the second camera located relative to the first?”, the algorithm will fail. It looks for matching pixels, keypoints, or textures. Finding none, it cannot mathematically compute the geometry.
However, as a human, you can make a reasonable guess. You know what classrooms look like. You know that rows of desks usually face the blackboard. You possess a “world prior”—a mental model of how the world is structured.
This brings us to a fascinating question posed by researchers from Google and Cornell University: Can we grant this same intuition to computers by using Generative Video Models?
In this post, we will deep dive into the paper “Can Generative Video Models Help Pose Estimation?”. We will explore a novel method called InterPose, which uses video generation AI (like Sora, Runway, or DynamiCrafter) to “hallucinate” the missing visual information between two non-overlapping images, allowing us to solve geometric problems that were previously impossible.

As illustrated in Figure 1 above, the core idea is simple yet profound: if two images don’t overlap, use AI to generate a video that connects them. Then, use that video to figure out where the cameras must have been.
Background: The Challenge of Relative Pose Estimation
Before dissecting the solution, we need to rigorously define the problem.
The Math of Pose
In computer vision, Relative Pose Estimation is the task of determining the 3D transformation between two camera views. If we have Image A and Image B, we want to find the rotation matrix (\(R\)) and translation vector (\(t\)) that moves the camera from position A to position B.
Mathematically, if \(T_A\) and \(T_B\) are the world-to-camera transformations for our two images, they are represented as:

We aim to recover the relative pose \(T_{rel} = T_B T_A^{-1}\).
The “Wide Baseline” Problem
Traditional methods (like SIFT + RANSAC) and even modern deep learning methods (like LoFTR) rely on correspondences. They identify specific points in Image A (e.g., the corner of a table) and try to find the exact same point in Image B. By analyzing how these points move, they calculate the camera’s movement.
This approach hits a hard wall in “wide baseline” scenarios—situations where the cameras are so far apart that the images share little to no visual overlap. If the corner of the table isn’t visible in both images, the math breaks down.
Recently, a state-of-the-art model called DUSt3R has shown impressive results by learning to align point clouds globally without explicit feature matching. However, even DUSt3R struggles when the viewpoint change is extreme. It essentially lacks the “imagination” to understand how two completely different views connect.
The Core Method: InterPose
The researchers propose InterPose, a framework that uses generative video models as a “world prior.”
Video models are trained on internet-scale video data. They have “seen” millions of camera pans, zooms, and rotations. They implicitly understand 3D geometry, object permanence, and scene layouts. The hypothesis is that we can extract this implicit knowledge to help with explicit pose estimation.
Step 1: Hallucinating the Bridge
The process begins with two static input images, \(I_A\) and \(I_B\). The goal is to create a dense visual transition between them. We treat these as the first and last frames of a video and ask a generative model to fill in the middle.
We can denote the video generation function as \(f_{vid}\), which takes the two images and a text prompt \(p\) (describing the scene) to produce a sequence of frames:

The researchers tested three models: DynamiCrafter, Runway Gen-3 Alpha Turbo, and Luma Dream Machine. By prompting these models with descriptions of the scene (generated automatically by GPT-4), they produce interpolated videos that smoothly transition from view A to view B.

As you can see in the figure above, the video models successfully “imagine” the intermediate steps. In the top row (DynamiCrafter), the model infers the spatial layout of a street scene. In the bottom row, it understands the 3D structure of a toy on a pedestal.
Step 2: The Problem of Inconsistency
If generative video models were perfect physics simulators, our job would be done. We could simply run Structure-from-Motion (SfM) on the generated video.
However, generative models are probabilistic and prone to hallucinations. They prioritize visual plausibility over geometric consistency. A generated video might look real to a human at a glance but contain physically impossible warping.

The figure above highlights these failure modes. In the top row, a microwave magically appears above a sink. In other rows, the geometry “morphs” rather than rotating rigidly. If we fed these “bad” frames into a pose estimator, the results would be garbage.
Step 3: The Self-Consistency Score
This is the most critical contribution of the paper. Since we cannot blindly trust a single generated video, the authors propose a Self-Consistency mechanism.
The intuition is: “Truth is consistent; hallucination is random.”
If we generate a valid video of a static scene, the camera pose estimated from any subset of frames should be roughly the same. If the video is morphing or glitching, different subsets of frames will yield wildly different pose estimates.
The Algorithm
- Generate Multiple Videos: For a single pair of images (\(I_A, I_B\)), generate \(n\) different videos (using different prompts or by swapping the order \(I_B \to I_A\)).
- Sample Subsets: For each video, randomly select \(m\) subsets of frames. Each subset includes the original start/end images plus a few generated intermediate frames.
- Estimate Poses: Feed each subset into a pose estimator (DUSt3R).
- Score Consistency: Calculate how much the pose estimates vary within that video.
Let \(f_{pose}\) be our pose estimator (DUSt3R) that takes a set of frames and outputs the relative pose \(\hat{T}\):

To measure inconsistency, the authors use the Medoid Distance. For a specific generated video, they look at all the poses predicted from its different frame subsets. The medoid is the “center” of these predictions. The score is the average distance of all predictions to this center.

A low \(D_{med}\) means the video is geometrically consistent—no matter which frames you look at, the geometry tells the same story.
Visualizing Consistency

The visualization above perfectly captures this. Look at the sphere plots on the right.
- Video 0 (Purple): The pose estimates are scattered all over the place. This video likely has morphing artifacts.
- Video 1 (Red): The pose estimates are tightly clustered. This video represents a consistent 3D movement.
The Final Metric
In practice, a video could be “consistently wrong” (e.g., highly consistent but estimating a 180-degree flip as 0 degrees). To prevent this, the authors add a bias term to anchor the prediction to the estimate derived from the original pair alone.
The total score \(D_{total}\) combines the self-consistency (medoid distance) and the distance from the baseline prediction:

The algorithm simply picks the video with the lowest \(D_{total}\) and uses its medoid pose as the final answer.
Experiments and Results
The researchers evaluated InterPose on four diverse datasets:
- Cambridge Landmarks (Outdoor, urban).
- ScanNet (Indoor).
- DL3DV-10K (Outdoor, scene-scale).
- NAVI (Object-centric).
They specifically targeted difficult pairs with significant yaw changes (50°-90°) and little overlap.
Quantitative Analysis
The results show that adding generated frames consistently helps pose estimation, outperforming the state-of-the-art DUSt3R model when used on image pairs alone.
Outward-Facing Scenes (The Hardest Case)
Outward-facing cameras (like walking down a street or looking around a room) often result in non-overlapping views. This is where InterPose shines.

In Table 1, look at the MRE (Mean Rotation Error) and MTE (Mean Translation Error). Lower is better.
- DUSt3R (Pair): On Cambridge Landmarks, the rotation error is 13.28°.
- Ours (Runway): Reduces this error to 10.78°.
- Ours (Dream Machine): On ScanNet, reduces rotation error from 21.31° down to 17.65°.
Notice the “Ours (Avg.)” vs “Ours (Medoid)” rows. Simply averaging all predictions (“Avg.”) often performs worse than the baseline. This proves that the Self-Consistency Score (Medoid) is crucial for filtering out the “hallucinated” junk.
Visual Proof
Does this actually look better in 3D? Yes.

In Figure 5, the third column shows DUSt3R trying to estimate poses from just the pair. The reconstructions are sparse or incorrect (look at the red ground truth cameras vs. the predicted blue/gold ones). The final column shows the result when utilizing generated video frames. The camera poses align much better with the ground truth, and the point clouds are denser and more coherent.
Center-Facing Scenes (Easier Case)
For object-centric datasets where the camera looks at an object (NAVI, DL3DV), there is usually some overlap.

As shown in Table 2, the baseline DUSt3R is already quite good here because overlap exists. However, InterPose still squeezes out performance gains, improving accuracy by roughly 1-4 degrees. It demonstrates that the method doesn’t “break” easier cases; it’s a safe add-on.
What about MASt3R?
The authors also tested against MASt3R, a newer follow-up to DUSt3R that uses feature matching. MASt3R is incredible when images overlap but fails catastrophically when they don’t.

Figure 8 illustrates this. On non-overlapping pairs, MASt3R’s reliance on matching causes it to produce broken geometry. The table included in the image (Table 3) quantitatively confirms that on datasets like Cambridge, MASt3R has a massive error (36.55°), while InterPose keeps it low (around 12°). This confirms that “world priors” from video are superior to feature matching when visual overlap is gone.
The “Left-to-Right” Bias
An interesting quirk discovered during the research is that video models are biased. Because many training videos pan from left to right, the models prefer generating that specific motion.

To mitigate this, the authors generate videos in both directions (\(I_A \to I_B\) and \(I_B \to I_A\)), giving the self-consistency algorithm a wider variety of motions to evaluate.
Conclusion & Implications
This paper bridges a significant gap in computer vision. For decades, “overlap” was the hard requirement for 3D reconstruction. If you didn’t see it in both images, you couldn’t map it.
InterPose proves that we can substitute visual overlap with semantic understanding. By leveraging the massive “world knowledge” inside generative video models, we can interpolate the missing visual data.
Here are the key takeaways:
- Video Models are Geometric Priors: They implicitly understand 3D structure, even if they aren’t perfect physics engines.
- Trust, but Verify: You cannot simply trust a generated video. A selection mechanism like the Self-Consistency Score is mandatory to filter out hallucinations.
- Filling the Gap: This approach enables pose estimation in “impossible” wide-baseline scenarios where standard feature matching yields zero results.
The “Oracle” results in the experiments (where the best generated video was manually selected) showed massive potential improvements (e.g., reducing error to ~3°). This suggests that as video generation models improve in fidelity and consistency, this technique will only become more powerful. We are moving toward a future where AI doesn’t just analyze the geometry it sees, but imagines the geometry it implies.
](https://deep-paper.org/en/paper/2412.16155/images/cover.png)