Introduction
In the world of computer vision, we are currently witnessing a golden age of Monocular Depth Estimation (MDE). Thanks to deep learning and massive datasets, modern neural networks can look at a single, flat 2D image and predict a surprisingly accurate dense depth map. Models like MiDaS, Marigold, and Depth Anything have transformed 2D images into pseudo-3D representations.
However, there is a gap between generating a pretty 3D visualization and actually using that data for rigorous geometric tasks. One of the most fundamental tasks in computer vision is Relative Pose Estimation: determining the rotation and translation between two cameras that took pictures of the same scene. Traditionally, this is done by matching feature points between images and using epipolar geometry.
Ideally, if we have dense depth maps from MDE models, we should be able to lock the two images together in 3D space perfectly. But in practice, simply trusting these depth maps leads to significant errors. Why? Because neural networks introduce invisible distortions. They often get the structure right but the scale and “shift” wrong.
In this post, we dive into the research paper “Relative Pose Estimation through Affine Corrections of Monocular Depth Priors.” This work proposes a robust mathematical framework to fix these distortions. By explicitly modeling affine corrections—specifically solving for both scale and shift—the researchers developed a method that significantly outperforms traditional geometric solvers.

As shown above, failing to account for the shift results in a distorted “funhouse mirror” effect (Left), whereas modeling it recovers the true geometry (Right).
Background: The Ambiguity of Depth
To understand why this new method is necessary, we must first understand the limitations of Monocular Depth Estimation.
When a deep learning model predicts depth from a single image, it faces an inherent ambiguity. A large object far away looks exactly like a small object close up. Because of this, most MDE models are trained to be scale-invariant or affine-invariant. They predict a value that is related to the true depth, but not the true depth itself.
If \(D_{pred}\) is the predicted depth and \(D_{gt}\) is the ground truth (metric) depth, the relationship is often linear:
\[ D_{gt} = \alpha \cdot D_{pred} + \beta \]Here, \(\alpha\) represents the scale and \(\beta\) represents the shift.
The Problem with Existing Solvers
Traditional approaches to using depth priors in pose estimation usually assume that \(\beta = 0\). They treat the predicted depth as correct “up to scale.” This is a reasonable assumption for some sensors, but for deep learning models, it is often false. The model might decide that the background wall is at depth 100 and the foreground chair is at depth 50. In reality, they might be at 10 meters and 5 meters (scale factor), or 15 meters and 10 meters (shift).
If you force a solver to align these mismatched depth maps using only scale, you introduce geometric inconsistencies that ruin the estimate of the camera’s rotation and translation.
The Core Method: Affine Corrections
The researchers propose a new set of geometric solvers that treat the depth maps as priors that need to be “corrected” via an affine transformation.
1. Problem Formulation
The goal is to estimate the relative pose \((\mathbf{R}, \mathbf{t})\) between two images, \(I_1\) and \(I_2\). We have a set of 2D pixel correspondences \((p_{1}, p_{2})\) and predicted depth maps \(D_1, D_2\).
The authors model the relationship between the predicted depth and the metric depth used for reconstruction as follows:

Here, \(a_i\) and \(b_i\) are the unknown scale and shift for each image. To align two images, we don’t need the absolute metric scale of the world (which is impossible to get from monocular vision alone); we only need the relative scale. Therefore, the researchers simplify the unknowns. They define a relative scale \(\alpha\) and two relative shifts \(\beta_1, \beta_2\):

This is the governing equation of the paper. We need to find the rotation \(\mathbf{R}\), translation \(\mathbf{t}\), scale \(\alpha\), and shifts \(\beta_1, \beta_2\) that make the two views geometrically consistent.
2. The Geometric Constraint
How do we solve for so many variables? We use the property of rigid body transformation. If we lift the 2D pixels into 3D points using the corrected depths, the distance between any two points in the scene must be the same regardless of which camera is looking at them.
First, we lift a 2D pixel \(p_{ij}\) (point \(j\) in image \(i\)) into 3D space \(P_{ij}\):

The standard relationship between the two camera views is:

Because rotation and translation preserve distances, the squared distance between any two points \(j\) and \(k\) must be identical in both camera frames. Let \(\delta\) be the vector between two points:

The core constraint is:

By expanding this equation using the affine depth variables (\(\alpha, \beta\)), the rotation and translation matrices are eliminated, leaving us with a system of polynomial equations that depend only on the depth parameters.

This derivation allows the authors to create solvers that find the depth correction parameters first, and then retrieve the pose.
3. The Three Solvers
Depending on what we know about the cameras, the number of unknowns changes. The authors developed three distinct solvers using Gr{"o}bner basis methods (a technique in algebraic geometry to solve systems of polynomial equations).
A. The Calibrated Solver (3-Point)
- Scenario: We know the focal lengths and intrinsics of both cameras (\(K_1, K_2\) are known).
- Unknowns: Relative pose, relative scale \(\alpha\), shifts \(\beta_1, \beta_2\).
- Requirements: This is a minimal problem requiring 3 point correspondences.
- Process: The distance constraints generate 3 equations. The solver creates a system that results in at most 4 possible solutions for the parameters.
B. The Shared-Focal Solver (4-Point)
- Scenario: We don’t know the focal length, but we know it’s the same for both images (e.g., a video taken by one camera).
- Unknowns: Pose, \(\alpha, \beta_1, \beta_2\), and the shared focal length \(f\).
- Requirements: Requires 4 point correspondences.
- Process: This results in an overdetermined system (more equations than strictly necessary from 4 points), but the authors select specific constraints to solve for the focal length alongside the affine parameters.
C. The Two-Focal Solver (4-Point)
- Scenario: Two completely different uncalibrated cameras (e.g., internet photo collections).
- Unknowns: Pose, \(\alpha, \beta_1, \beta_2\), and two different focal lengths \(f_1, f_2\).
- Requirements: Requires 4 point correspondences.
- Process: Surprisingly, this general case also has efficient solutions, generating at most 4 real solutions.
4. The Hybrid Estimation Pipeline
A solver alone isn’t enough. In the real world, feature matches are noisy, and depth maps have artifacts. To handle this, the authors integrate their solvers into a Hybrid LO-MSAC (Locally Optimized RANSAC) pipeline.

The “Hybrid” aspect is crucial. The pipeline doesn’t only use the new affine-depth solvers. It dynamically switches between:
- Classic Point-Based Solvers: (e.g., the 5-point algorithm). These ignore depth maps and rely purely on epipolar geometry.
- Depth-Aware Solvers: The new affine solvers described above.
This robustness ensures that if the depth map is terrible (e.g., a blank wall), the system falls back to standard feature matching. If the depth map is good, the system leverages it to resolve ambiguities that point-based methods can’t handle.
Scoring the Solutions
To determine which solution is “best” during the RANSAC loop, the system uses a combined score.
It calculates the Depth-Induced Reprojection Error:

And combines it with the standard Sampson Error (\(E_s\)) used in epipolar geometry:

This comprehensive scoring function ensures that the selected pose is consistent with both the 2D feature matches and the 3D depth priors.
Experiments and Results
The researchers validated their method on standard benchmarks like ScanNet (indoor), MegaDepth (outdoor), and ETH3D. They tested both calibrated and uncalibrated scenarios.
Performance on Calibrated Data
The table below shows the results on ScanNet-1500. The metrics are rotation error (\(\epsilon_R\)), translation error (\(\epsilon_t\)), and Pose Error AUC (Area Under Curve—higher is better).

Key Takeaways:
- Consistent Improvement: “Ours-calib” consistently achieves lower errors and higher AUC than PoseLib (standard RANSAC) and scale-only solvers (2PT+D).
- Metric Depth vs. Affine Depth: Interestingly, the method improves performance even when using “Metric” depth models like Depth Anything (DA-met). This proves that even models trained to output absolute depth still contain affine distortions that need correcting.
Handling Unknown Focal Lengths
One of the standout contributions is the ability to handle uncalibrated cameras. In the table below (using generated pairs with random focal lengths), the “Ours-tf” (Two-Focal) solver drastically outperforms the standard 7-point algorithm.

The median focal length error (\(\epsilon_f\)) drops significantly (from ~60% error in PoseLib to ~17-29% in Ours), which in turn allows for much more accurate pose estimation.
Visualizing the Improvement
Numbers are great, but point clouds tell the real story. Below is a comparison on the ETH3D dataset.

- Left (Baseline): Notice how the walls are slanted and the floor is warped. The scale-only assumption failed to align the geometry.
- Middle (Ours): The affine correction recovers the straight lines of the walls and the flat floor, closely matching the Right (Ground Truth).
Another example from ScanNet shows the severe distortion caused by ignoring shift:

The “Scale-only” reconstruction (Left) looks like it has been smashed or stretched incorrectly. The affine method (Middle) preserves the structural integrity of the scene.
Why Shift Matters: The Ablation Study
A skeptic might ask: “Is the shift parameter \(\beta\) really that important? Maybe just solving for scale \(\alpha\) is enough?”
The authors address this directly with an ablation study. They added synthetic shifts to ground truth depth maps and measured how the error rates changed for different methods.

In the graphs above:
- Blue line (Ours): The error remains flat and low, regardless of the shift magnitude. The solver correctly identifies and removes the shift.
- Red/Green lines (Baselines): As the shift increases, the error explodes.
The authors also analyzed the distribution of shift values in real-world predictions from MDE models:

The red line indicates a shift of 10%. As seen in the histograms, real-world shift values are frequently much larger than 10%, especially for non-metric models (like Marigold) on outdoor datasets (MegaDepth). This confirms that \(\beta\) is not a negligible noise term—it is a significant geometric component that must be modeled.
Conclusion
The paper “Relative Pose Estimation through Affine Corrections of Monocular Depth Priors” presents a convincing argument: we cannot trust the raw output of monocular depth models as perfect metric geometry. However, we can trust their internal consistency if we apply the right mathematical corrections.
By introducing an affine transformation model that accounts for both scale and shift, and wrapping it in a hybrid solver that falls back to classic geometry when needed, the researchers have created a powerful new tool for 3D vision.
Key Takeaways:
- Shift is critical: Treating depth as “scale-invariant” is insufficient; “affine-invariant” is the correct approach.
- Solvability: It is possible to solve for pose, scale, shift, and focal length simultaneously using as few as 3 or 4 points.
- Hybrid Robustness: Combining depth-based solvers with classic point-based solvers yields state-of-the-art results.
This work paves the way for more reliable Structure-from-Motion (SfM) and SLAM systems that can leverage the massive progress in monocular depth estimation without being misled by its inherent ambiguities.
](https://deep-paper.org/en/paper/2501.05446/images/cover.png)