Introduction

Imagine trying to build a 3D map of a room using a collection of photos. This process, known as Structure-from-Motion (SfM), is the backbone of modern photogrammetry and 3D reconstruction. When you use standard photos from a smartphone or a DSLR, current algorithms like COLMAP work wonders. But what happens if you use a fisheye lens, a GoPro with wide FOV, or a complex catadioptric 360-degree camera?

Suddenly, the standard pipelines break.

The reason lies in the math used to model the camera. Most SfM tools rely on parametric models—rigid mathematical formulas (like polynomials) that approximate how a lens bends light. If the camera’s distortion doesn’t fit the specific polynomial you chose, or if you don’t know the calibration parameters beforehand, the reconstruction fails.

In a recent CVPR paper, researchers introduced GenSfM, a generic pipeline that throws away the rigid polynomials. Instead, it learns the camera’s behavior on the fly using a flexible, non-parametric model. This approach allows a single algorithm to reconstruct 3D scenes from standard cameras, extreme fisheye lenses, and everything in between, without prior knowledge of the lens type.

A visual comparison of the pipeline. Top: Distorted input images. Middle: Calibration process. Bottom: The resulting sparse point cloud reconstruction.

In this post, we will tear down how GenSfM works, why splines are better than polynomials for this task, and how this method manages to self-calibrate even the most severely distorted cameras.


Background: The Limits of Parametric Models

To understand why this new method is necessary, we first need to look at how computers traditionally understand cameras.

The Pinhole and Distortion

The simplest camera model is the pinhole model. It assumes light travels in straight lines through a single point. However, real lenses—especially wide-angle ones—bend light, causing straight lines in the world to appear curved in the image.

To fix this, computer vision engineers use distortion models. A standard projection equation looks like this:

Equation 1: General projection function with distortion.

Here, \(\Pi\) is the pinhole projection, and \(\mathcal{D}\) is a non-linear function that models the distortion. The industry standard for decades has been the Brown-Conrady model, which expresses distortion as a polynomial based on the radial distance \(r\) from the image center:

Equation 2: The Brown-Conrady polynomial distortion model.

This works well for standard photography. But as the field of view (FOV) grows, \(r\) grows rapidly. For fisheye lenses, where the FOV can reach or exceed 180 degrees, standard polynomials become unstable or fail to capture the complex mapping.

The Alternative: Non-Parametric Models

Instead of guessing which polynomial fits a lens (\(k_1, k_2, k_3...\)), we can use a non-parametric model. Think of this as a lookup table or a smooth curve that maps the angle of incoming light to a pixel position. It doesn’t assume a specific shape (like a parabola); it just fits the data.

While non-parametric models are flexible, integrating them into a full Structure-from-Motion pipeline is difficult. You need to estimate the 3D structure, the camera poses, and this flexible lens model simultaneously. Doing so without falling into local minima (getting stuck in the wrong solution) is the core problem this paper solves.


The GenSfM Pipeline

The authors propose a “Generic SfM” pipeline. It follows the standard incremental approach (adding images one by one) but introduces a novel way to handle camera calibration.

Flowchart of the GenSfM pipeline showing initialization, incremental reconstruction, and the dual process of image registration and calibration.

The pipeline consists of three innovative stages:

  1. Initialization using radial constraints (ignoring distortion).
  2. Adaptive Calibration using splines.
  3. Mixed Triangulation that handles both calibrated and uncalibrated data.

1. The Camera Representation (Splines)

Instead of parameters \(k_1, k_2\), the authors model the camera intrinsics as a mapping between the opening angle \(\theta\) (the angle between the incoming light ray and the optical axis) and the image radius \(r\) (distance from the center pixel).

Equation 3: The projection function mapping angle theta to image coordinates.

Here, \(M[\theta]\) is the function that determines how far a light ray lands from the image center. To make this function flexible, they use cubic splines.

A spline is defined by a set of control points. By adjusting the position of these control points, the curve can take almost any shape, allowing it to model a standard lens or a complex catadioptric mirror equally well.

Equation 4: The set of control points for the spline.

This representation is smooth, differentiable (essential for optimization), and generic.

2. Initialization and Radial Constraints

Before the system knows the lens distortion, how can it figure out where the cameras are?

The authors utilize the Radial Alignment Constraint (RAC). The key insight is simple: no matter how much a lens distorts an image radially (pushing pixels in or out from the center), the angle of the pixel relative to the center remains unchanged.

The pipeline starts by selecting four images and estimating a “radial quadrifocal tensor.” This sounds complex, but it essentially allows the system to compute the initial geometry of the scene by only looking at the angles of feature points, completely ignoring the radial distortion. This provides a rough initial structure without needing to know if the camera is a fisheye or a pinhole.

3. Adaptive Calibration

This is the most critical contribution of the paper. In a non-parametric model, you have a lot of freedom. If you try to calibrate the entire image at once with sparse data, the model might “hallucinate” distortions in empty areas of the image.

To prevent this, GenSfM employs Adaptive Calibration.

It doesn’t try to learn the distortion for the whole image immediately. Instead, it identifies a calibrated interval \([\theta_{min}, \theta_{max}]\) where there are enough consistent observations.

  1. Data Collection: As images are registered, the system collects 2D-3D correspondences.
  2. Filtering: It identifies regions with high-density matches.
  3. Spline Fitting: It fits the spline only to this valid region.

As the reconstruction grows and more points are added, this valid region expands, progressively calibrating the camera from the center outwards.

4. Mixed Triangulation and Bundle Adjustment

Because the camera is often only “partially” calibrated (e.g., the center is known, but the edges are not yet), the system needs a way to triangulate 3D points using this mixed state.

The authors propose a Mixed Triangulation method:

Case A: Uncalibrated Region If a point falls in a region where the distortion is unknown, the system uses the radial constraint. It forces the 3D point to lie on the plane defined by the camera center and the radial line in the image.

Equation 5: The 1D radial constraint for uncalibrated points.

Case B: Calibrated Region If the point falls inside the calibrated spline region, the system can “undistort” it and use the full 2D reprojection constraint, which is much tighter and more accurate.

Equation 6: The 2D constraint for calibrated points.

Finally, during Bundle Adjustment (the global optimization step), the system optimizes two different error functions simultaneously.

For calibrated points, it minimizes the standard reprojection error using the learned spline \(M[\theta]\):

Equation 7: Full reprojection error for the calibrated region.

For uncalibrated points, it minimizes the distance to the radial line (geometric error), ensuring they still contribute to the geometry even without a distortion model:

Equation 8: Radial error for the uncalibrated region.


Experiments and Results

The researchers put GenSfM to the test against the industry standard, COLMAP, and other radial-based methods. They used datasets ranging from standard DSLR photos to severely distorted fisheye and catadioptric images.

Performance on Distorted Images

The most striking results come from datasets where the images have significant distortion. In the figure below, you can see a comparison. The top rows show standard scenes, while the bottom rows show fisheye and upward-facing wide-angle shots.

Comparison of reconstruction results. Left columns show sparse/failed reconstructions from other methods. Right column shows the dense, correct reconstruction from GenSfM.

Key Takeaway: COLMAP (left columns) often fails to register images or creates broken, sparse models when the distortion is high or the wrong parametric model is chosen. GenSfM (right column) consistently produces dense, accurate point clouds regardless of the lens type.

Quantitative Accuracy

The authors validated their approach on the BabelCalib dataset, a benchmark specifically designed for diverse camera models.

Table showing reprojection errors. GenSfM achieves low RMS error and high calibration percentages across diverse datasets. Table comparing GenSfM against COLMAP on distorted datasets. GenSfM maintains high performance while parametric models struggle.

As shown in the tables, GenSfM achieves reprojection errors (RMSE) comparable to or better than methods that use pre-calibrated parametric models. On the “Distorted ETH3D” dataset, where artificial distortion was added, COLMAP’s performance dropped significantly, whereas GenSfM remained robust.

visualizing the Calibration

One of the satisfying aspects of this work is visualizing the learned calibration. Below, we see the results on the BabelCalib dataset. The blue line represents the estimated spline, which tracks the ground truth (orange) almost perfectly.

Calibration maps showing the estimated spline (blue) matching the ground truth (orange) alongside the undistorted images.

Furthermore, the system can effectively “undistort” images that look completely alien to the human eye, such as those from catadioptric systems (cameras using mirrors).

Undistortion results on catadioptric and fisheye images, transforming circular raw images into rectilinear views.

How Many Control Points?

A non-parametric model raises a question: how complex should the spline be? The authors conducted an ablation study to find the optimal number of control points.

Graph showing the impact of control point number on reprojection error. The error drops significantly around 8-10 control points.

The results suggest that 10 control points are the “sweet spot.” Fewer than that, and the model cannot capture the complex curvature of the lens. More than that, and the optimization becomes unstable, potentially overfitting to noise.


Conclusion and Implications

The “GenSfM” pipeline represents a significant step forward for 3D reconstruction. By moving away from rigid parametric models, it democratizes SfM for a much wider range of imaging devices.

Key Takeaways:

  • Flexibility: The spline-based model adapts to pinholes, fisheyes, and catadioptric cameras without manual tuning.
  • Robustness: The adaptive calibration scheme ensures that the model doesn’t overfit to sparse data, expanding the calibrated region only when safe.
  • Hybrid Optimization: Using both 1D radial constraints and 2D reprojection constraints allows the system to utilize every available geometric cue.

For students and researchers in computer vision, this highlights an important trend: geometry and deep learning aren’t the only ways forward; sometimes, better mathematical representations (like splines) combined with robust engineering (adaptive calibration) yield the best results. This approach opens the door for “in-the-wild” reconstruction using mixed crowd-sourced images from unknown cameras, a holy grail for large-scale mapping projects.