Introduction: The Human Ability to “Hallucinate” Geometry

Imagine you are standing in a classroom. You take a photo of the blackboard at the front. Then, you turn around and walk to the back of the room, taking a photo of a student’s desk. These two photos have zero overlap—there are no common visual features between them.

If you feed these two images into a traditional computer vision algorithm and ask, “Where is the second camera located relative to the first?”, the algorithm will fail. It looks for matching pixels, keypoints, or textures. Finding none, it cannot mathematically compute the geometry.

However, as a human, you can make a reasonable guess. You know what classrooms look like. You know that rows of desks usually face the blackboard. You possess a “world prior”—a mental model of how the world is structured.

This brings us to a fascinating question posed by researchers from Google and Cornell University: Can we grant this same intuition to computers by using Generative Video Models?

In this post, we will deep dive into the paper “Can Generative Video Models Help Pose Estimation?”. We will explore a novel method called InterPose, which uses video generation AI (like Sora, Runway, or DynamiCrafter) to “hallucinate” the missing visual information between two non-overlapping images, allowing us to solve geometric problems that were previously impossible.

Comparison of pose estimation methods. On the left, a standard Pose Model fails on non-overlapping images (red Xs). On the right, the proposed method generates a video to bridge the gap, resulting in accurate pose estimation (green checks).

As illustrated in Figure 1 above, the core idea is simple yet profound: if two images don’t overlap, use AI to generate a video that connects them. Then, use that video to figure out where the cameras must have been.

Background: The Challenge of Relative Pose Estimation

Before dissecting the solution, we need to rigorously define the problem.

The Math of Pose

In computer vision, Relative Pose Estimation is the task of determining the 3D transformation between two camera views. If we have Image A and Image B, we want to find the rotation matrix (\(R\)) and translation vector (\(t\)) that moves the camera from position A to position B.

Mathematically, if \(T_A\) and \(T_B\) are the world-to-camera transformations for our two images, they are represented as:

Matrix representations of transformations T_A and T_B, including Rotation (R) and Translation (t) components.

We aim to recover the relative pose \(T_{rel} = T_B T_A^{-1}\).

The “Wide Baseline” Problem

Traditional methods (like SIFT + RANSAC) and even modern deep learning methods (like LoFTR) rely on correspondences. They identify specific points in Image A (e.g., the corner of a table) and try to find the exact same point in Image B. By analyzing how these points move, they calculate the camera’s movement.

This approach hits a hard wall in “wide baseline” scenarios—situations where the cameras are so far apart that the images share little to no visual overlap. If the corner of the table isn’t visible in both images, the math breaks down.

Recently, a state-of-the-art model called DUSt3R has shown impressive results by learning to align point clouds globally without explicit feature matching. However, even DUSt3R struggles when the viewpoint change is extreme. It essentially lacks the “imagination” to understand how two completely different views connect.

The Core Method: InterPose

The researchers propose InterPose, a framework that uses generative video models as a “world prior.”

Video models are trained on internet-scale video data. They have “seen” millions of camera pans, zooms, and rotations. They implicitly understand 3D geometry, object permanence, and scene layouts. The hypothesis is that we can extract this implicit knowledge to help with explicit pose estimation.

Step 1: Hallucinating the Bridge

The process begins with two static input images, \(I_A\) and \(I_B\). The goal is to create a dense visual transition between them. We treat these as the first and last frames of a video and ask a generative model to fill in the middle.

We can denote the video generation function as \(f_{vid}\), which takes the two images and a text prompt \(p\) (describing the scene) to produce a sequence of frames:

Equation showing the video generation function f_vid taking images I_A, I_B and prompt p to output a sequence of frames.

The researchers tested three models: DynamiCrafter, Runway Gen-3 Alpha Turbo, and Luma Dream Machine. By prompting these models with descriptions of the scene (generated automatically by GPT-4), they produce interpolated videos that smoothly transition from view A to view B.

Qualitative examples of generated videos. The left column is the starting image, the right is the target, and the middle shows the AI-generated transition frames.

As you can see in the figure above, the video models successfully “imagine” the intermediate steps. In the top row (DynamiCrafter), the model infers the spatial layout of a street scene. In the bottom row, it understands the 3D structure of a toy on a pedestal.

Step 2: The Problem of Inconsistency

If generative video models were perfect physics simulators, our job would be done. We could simply run Structure-from-Motion (SfM) on the generated video.

However, generative models are probabilistic and prone to hallucinations. They prioritize visual plausibility over geometric consistency. A generated video might look real to a human at a glance but contain physically impossible warping.

Examples of failure modes in video generation. A microwave appears out of nowhere; scenes morph unnaturally; objects change appearance.

The figure above highlights these failure modes. In the top row, a microwave magically appears above a sink. In other rows, the geometry “morphs” rather than rotating rigidly. If we fed these “bad” frames into a pose estimator, the results would be garbage.

Step 3: The Self-Consistency Score

This is the most critical contribution of the paper. Since we cannot blindly trust a single generated video, the authors propose a Self-Consistency mechanism.

The intuition is: “Truth is consistent; hallucination is random.”

If we generate a valid video of a static scene, the camera pose estimated from any subset of frames should be roughly the same. If the video is morphing or glitching, different subsets of frames will yield wildly different pose estimates.

The Algorithm

  1. Generate Multiple Videos: For a single pair of images (\(I_A, I_B\)), generate \(n\) different videos (using different prompts or by swapping the order \(I_B \to I_A\)).
  2. Sample Subsets: For each video, randomly select \(m\) subsets of frames. Each subset includes the original start/end images plus a few generated intermediate frames.
  3. Estimate Poses: Feed each subset into a pose estimator (DUSt3R).
  4. Score Consistency: Calculate how much the pose estimates vary within that video.

Let \(f_{pose}\) be our pose estimator (DUSt3R) that takes a set of frames and outputs the relative pose \(\hat{T}\):

Equation defining the pose estimator function f_pose.

To measure inconsistency, the authors use the Medoid Distance. For a specific generated video, they look at all the poses predicted from its different frame subsets. The medoid is the “center” of these predictions. The score is the average distance of all predictions to this center.

Equation for Medoid Distance (D_med), calculating the minimum average distance between pose estimates.

A low \(D_{med}\) means the video is geometrically consistent—no matter which frames you look at, the geometry tells the same story.

Visualizing Consistency

Visualization of self-consistency. On the left (a), two videos are generated. On the right (b, c), the poses estimated from Video 1 (red) are tightly clustered, indicating high consistency. Video 0 (purple) produces scattered, unreliable poses.

The visualization above perfectly captures this. Look at the sphere plots on the right.

  • Video 0 (Purple): The pose estimates are scattered all over the place. This video likely has morphing artifacts.
  • Video 1 (Red): The pose estimates are tightly clustered. This video represents a consistent 3D movement.

The Final Metric

In practice, a video could be “consistently wrong” (e.g., highly consistent but estimating a 180-degree flip as 0 degrees). To prevent this, the authors add a bias term to anchor the prediction to the estimate derived from the original pair alone.

The total score \(D_{total}\) combines the self-consistency (medoid distance) and the distance from the baseline prediction:

Equation for D_total, combining Medoid Distance and the distance to the baseline pose estimate.

The algorithm simply picks the video with the lowest \(D_{total}\) and uses its medoid pose as the final answer.

Experiments and Results

The researchers evaluated InterPose on four diverse datasets:

  1. Cambridge Landmarks (Outdoor, urban).
  2. ScanNet (Indoor).
  3. DL3DV-10K (Outdoor, scene-scale).
  4. NAVI (Object-centric).

They specifically targeted difficult pairs with significant yaw changes (50°-90°) and little overlap.

Quantitative Analysis

The results show that adding generated frames consistently helps pose estimation, outperforming the state-of-the-art DUSt3R model when used on image pairs alone.

Outward-Facing Scenes (The Hardest Case)

Outward-facing cameras (like walking down a street or looking around a room) often result in non-overlapping views. This is where InterPose shines.

Table 1 showing results on Cambridge Landmarks and ScanNet. InterPose (Ours) consistently achieves lower errors than baselines.

In Table 1, look at the MRE (Mean Rotation Error) and MTE (Mean Translation Error). Lower is better.

  • DUSt3R (Pair): On Cambridge Landmarks, the rotation error is 13.28°.
  • Ours (Runway): Reduces this error to 10.78°.
  • Ours (Dream Machine): On ScanNet, reduces rotation error from 21.31° down to 17.65°.

Notice the “Ours (Avg.)” vs “Ours (Medoid)” rows. Simply averaging all predictions (“Avg.”) often performs worse than the baseline. This proves that the Self-Consistency Score (Medoid) is crucial for filtering out the “hallucinated” junk.

Visual Proof

Does this actually look better in 3D? Yes.

Qualitative comparison of 3D reconstructions. The third column shows DUSt3R failing (incomplete geometry). The final column shows InterPose successfully reconstructing the scene using generated frames.

In Figure 5, the third column shows DUSt3R trying to estimate poses from just the pair. The reconstructions are sparse or incorrect (look at the red ground truth cameras vs. the predicted blue/gold ones). The final column shows the result when utilizing generated video frames. The camera poses align much better with the ground truth, and the point clouds are denser and more coherent.

Center-Facing Scenes (Easier Case)

For object-centric datasets where the camera looks at an object (NAVI, DL3DV), there is usually some overlap.

Table 2 showing results on DL3DV-10K and NAVI. The improvements are smaller but still present.

As shown in Table 2, the baseline DUSt3R is already quite good here because overlap exists. However, InterPose still squeezes out performance gains, improving accuracy by roughly 1-4 degrees. It demonstrates that the method doesn’t “break” easier cases; it’s a safe add-on.

What about MASt3R?

The authors also tested against MASt3R, a newer follow-up to DUSt3R that uses feature matching. MASt3R is incredible when images overlap but fails catastrophically when they don’t.

Figure 8 showing failure cases of MASt3R. The right column shows fragmented, broken meshes compared to DUSt3R on the left.

Figure 8 illustrates this. On non-overlapping pairs, MASt3R’s reliance on matching causes it to produce broken geometry. The table included in the image (Table 3) quantitatively confirms that on datasets like Cambridge, MASt3R has a massive error (36.55°), while InterPose keeps it low (around 12°). This confirms that “world priors” from video are superior to feature matching when visual overlap is gone.

The “Left-to-Right” Bias

An interesting quirk discovered during the research is that video models are biased. Because many training videos pan from left to right, the models prefer generating that specific motion.

Illustration of left-to-right bias in video generation models and the solution of swapping input order.

To mitigate this, the authors generate videos in both directions (\(I_A \to I_B\) and \(I_B \to I_A\)), giving the self-consistency algorithm a wider variety of motions to evaluate.

Conclusion & Implications

This paper bridges a significant gap in computer vision. For decades, “overlap” was the hard requirement for 3D reconstruction. If you didn’t see it in both images, you couldn’t map it.

InterPose proves that we can substitute visual overlap with semantic understanding. By leveraging the massive “world knowledge” inside generative video models, we can interpolate the missing visual data.

Here are the key takeaways:

  1. Video Models are Geometric Priors: They implicitly understand 3D structure, even if they aren’t perfect physics engines.
  2. Trust, but Verify: You cannot simply trust a generated video. A selection mechanism like the Self-Consistency Score is mandatory to filter out hallucinations.
  3. Filling the Gap: This approach enables pose estimation in “impossible” wide-baseline scenarios where standard feature matching yields zero results.

The “Oracle” results in the experiments (where the best generated video was manually selected) showed massive potential improvements (e.g., reducing error to ~3°). This suggests that as video generation models improve in fidelity and consistency, this technique will only become more powerful. We are moving toward a future where AI doesn’t just analyze the geometry it sees, but imagines the geometry it implies.