Introduction

In the rapidly evolving world of computer vision, the dream has always been the same: take a handful of photos of an object or a scene, and instantly generate a photorealistic, navigable 3D model. For a long time, this was a difficult, computation-heavy task. Then came Neural Radiance Fields (NeRFs), which revolutionized the quality of view synthesis but were painfully slow to render. More recently, 3D Gaussian Splatting (3DGS) emerged, offering real-time rendering speeds with quality that rivals or exceeds NeRFs.

However, there is a catch. Both NeRF and 3DGS thrive on data. They typically require dozens, if not hundreds, of images taken from densely spaced viewpoints to reconstruct a scene accurately. When you starve these models of data—giving them only 3 or 4 images (a “sparse view” or “few-shot” scenario)—they tend to collapse. The output becomes blurry, geometry gets distorted, and “floaters” (ghostly artifacts) appear in empty space.

Today, we are diving deep into a research paper that tackles this specific problem: NexusGS.

Figure 1. NexusGS distinguishes itself from both NeRF-based and 3DGS-based competitors by incorporating epipolar depth priors, significantly improving the accuracy of depth maps and enhancing the fidelity of rendered images. This effectiveness in handling sparse input views is achieved through innovative point cloud densification with depth blending and pruning strategies.

As illustrated in Figure 1, NexusGS manages to reconstruct high-fidelity scenes with accurate depth maps even when input views are scarce. Unlike competitors that often produce noisy or overly smoothed results, NexusGS leverages the fundamental laws of geometry—specifically epipolar geometry—to initialize the 3D model correctly.

In this post, we will unpack how NexusGS works. We will explore how it moves away from “guessing” depth using neural networks and instead calculates it using optical flow and geometric constraints. We will break down its three-stage pipeline: the Epipolar Depth Nexus, Flow-Resilient Depth Blending, and Flow-Filtered Depth Pruning. By the end, you will understand how injecting classical geometric priors into modern neural rendering can solve the sparse view challenge.

Background: The Sparse View Struggle

To understand why NexusGS is necessary, we first need to understand how 3D Gaussian Splatting works and why it fails in sparse settings.

3D Gaussian Splatting 101

3DGS represents a scene not as a solid mesh or a neural network (like NeRF), but as a cloud of 3D Gaussians (ellipsoids). Each Gaussian has parameters: position, rotation, scale, color (spherical harmonics), and opacity. To render an image, these Gaussians are rasterized (splatted) onto the camera plane.

The mathematical definition of a Gaussian is:

Equation for Gaussian definition

Rendering involves sorting these Gaussians and blending them front-to-back:

Equation for Rendering Color

The Initialization Problem

The most critical step in 3DGS is initialization. The algorithm needs a starting point—usually a sparse point cloud generated by Structure-from-Motion (SfM) software like COLMAP. In a dense setting (many photos), COLMAP produces a rich point cloud, and 3DGS optimizes it into a beautiful scene.

In a sparse setting (few photos), SfM struggles to find enough matching features between images because the change in perspective is too drastic. The resulting point cloud is too sparse or erroneous. When 3DGS starts with bad data, it tries to compensate by creating Gaussians in random places, leading to the aforementioned artifacts.

The Problem with Current Solutions

Previous attempts to fix this usually rely on Monocular Depth Estimation. These are neural networks trained to look at a single 2D image and predict a depth map. While impressive, they suffer from scale ambiguity (they don’t know the true size of the world) and inconsistency (the depth predicted for Image A might not geometrically align with the depth predicted for Image B).

NexusGS takes a different approach. Instead of relying on “black box” depth estimators, it uses the explicit geometric relationship between cameras—Epipolar Geometry—to calculate depth precisely.

Core Method: The NexusGS Pipeline

The researchers behind NexusGS argue that we don’t need complex regularizations if we just start with a better point cloud. Their method focuses on generating a dense, accurate point cloud before the standard 3DGS training begins.

The overall pipeline is visualized below:

Figure 2. Given a few input images, our method first computes depth using optical flow and camera poses at the Epipolar Depth Nexus step. We then fuse depth values from different views, minimizing flow errors with flow-resilient depth blending. Before forming the final dense point cloud, outlier depths are removed at the flow-filter depth pruning step. In training, we do not need depth regularization, thanks to the embedded epipolar depth prior in the point cloud.

The process consists of three main stages, which we will examine in detail:

  1. Epipolar Depth Nexus: Calculating depth using optical flow and geometry.
  2. Flow-Resilient Depth Blending: Merging depth estimates intelligently.
  3. Flow-Filtered Depth Pruning: Removing unreliable points.

Step 1: Epipolar Depth Nexus

The foundation of this method is the Epipolar Line. In stereo vision, if you have a point in Image A, the corresponding point in Image B must lie on a specific line called the epipolar line. This is a hard geometric constraint determined by the relative positions of the two cameras.

Optical Flow + Geometry

The authors utilize Optical Flow estimators (specifically a pre-trained network) to find matching pixels between images. Optical flow predicts where a pixel in the source view moves to in the target view.

\[M_{flow}^{i \rightarrow j} = f(I_i, I_j)\]\[\hat{p}_j = p_i + M_{flow}^{i \rightarrow j}(p_i)\]

Here, \(\hat{p}_j\) is the predicted position in the target view. However, optical flow is not perfect. The predicted point \(\hat{p}_j\) might not fall exactly on the epipolar line due to estimation errors.

To fix this, NexusGS calculates the “perpendicular foot”—the point on the epipolar line closest to the optical flow prediction.

Equation for perpendicular foot calculation

By forcing the point onto the epipolar line, the system satisfies the geometric requirement for triangulation. Once the point is on the line, the depth can be explicitly calculated using triangulation logic. The paper derives a specific formula for this depth \(D^{i \rightarrow j}\):

Equation for Depth Calculation

This calculation is geometrically illustrated in Figure 8 below. It essentially uses the area of triangles formed by the camera centers and the image points to derive the depth distance.

Figure 8. Illustration of the geometry relationships used in Epipolar Depth Nexus step.

By doing this for every pixel, NexusGS generates a depth map based on actual multi-view geometry rather than single-view guessing.

Step 2: Flow-Resilient Depth Blending

Here is the challenge: If you have 3 images, you can calculate the depth for a pixel in Image 1 by comparing it to Image 2, AND by comparing it to Image 3. This gives you multiple depth values for the same pixel. Which one is correct?

A naive approach would be to average them. However, if one optical flow estimate is significantly wrong, it will skew the average, resulting in an incorrect 3D point.

NexusGS introduces Flow-Resilient Depth Blending (FRDB). The core insight here is strictly mathematical: we want to choose the depth estimate that is least sensitive to errors in optical flow.

The Sensitivity Analysis

Consider Figure 3 below. It shows potential error scenarios.

Figure 3. We summarize potential depth blending error scenarios and compare the strategies of selecting the minimum depth versus the average depth.

The authors define two distances:

  • \(dis_{ref}\): The distance from the source camera to the 3D point (what we want).
  • \(dis_{pro}\): The projection distance in the target view (related to pixel position).

If the optical flow is slightly wrong (a small shift in pixel position), it changes the calculated depth (\(dis_{ref}\)). The relationship isn’t linear. In some geometric configurations, a tiny pixel error results in a huge depth error. In others, the depth is stable even if the pixel is slightly off.

The authors calculate the gradient (rate of change) of the reference distance with respect to the projection distance:

Equation for Gradient of Reference Distance

A smaller gradient means the depth calculation is stable—it’s resilient to flow errors. A large gradient means the calculation is volatile.

Therefore, instead of averaging depths, NexusGS selects the depth value corresponding to the view that minimizes this gradient:

Equation for selecting the best depth

This smart selection process ensures that the initialized point cloud is composed of the most geometrically reliable points available.

Step 3: Flow-Filtered Depth Pruning

Even with the best blending strategy, some points are just bad. This happens when the optical flow fails completely—perhaps due to occlusion (the object is visible in one view but hidden behind something in another) or repetitive textures.

To clean the data, the authors introduce Flow-Filtered Depth Pruning (FFDP).

Remember the “perpendicular foot” from Step 1? We moved the optical flow prediction onto the epipolar line. The distance we moved it is a measure of error. If the optical flow prediction was very far from the epipolar line, it means the flow estimator likely failed or the geometric constraint is violated.

The distance \(g^{i \rightarrow j}\) is calculated as:

Equation for Epipolar Distance

The method simply applies a threshold \(\epsilon_d\). If the distance to the epipolar line is too large, the point is discarded.

Equation for Pruning Threshold

This filtering step removes “outliers”—points that would otherwise become floating artifacts or noise in the final 3D reconstruction.

The Result: A Dense Initialization

After these three steps, we have a dense, accurate set of 3D points. These points are used to initialize the 3D Gaussians. Because the initialization is so strong, the authors find they don’t need complex depth regularization terms in the loss function during training. They simply train with standard color (L1) and structural similarity (D-SSIM) losses:

Equation for Loss Function

Experiments and Results

The researchers evaluated NexusGS on standard benchmarks including the LLFF (real-world forward-facing scenes), DTU (object-centric), and Mip-NeRF360 datasets. They compared it against top-tier methods like SparseNeRF, FSGS, and DNGaussian.

Quantitative Superiority

The results are compelling. Table 1 (below) summarizes the performance on three major datasets.

Table 1. Quantitative evaluations on the LLFF, DTU and Mip-NeRF360 datasets.

NexusGS (labeled “Ours”) consistently achieves the highest PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index), and the lowest LPIPS (perceptual error).

  • On LLFF (3 views), it beats the second-best method by a significant margin of 0.62 dB in PSNR.
  • It performs exceptionally well on MipNeRF-360, a challenging dataset with unbounded backgrounds, where other methods struggle with over-smoothing.

Qualitative Visuals

The numbers are backed up by visual evidence. Let’s look at the comparisons.

LLFF Dataset (Real World): In Figure 4, compare the “Ours” column to the others.

  • ViP-NeRF fails to reconstruct the geometry of the leaves properly.
  • SimpleNeRF has incomplete edges.
  • CoR-GS produces a blurry, over-smoothed result.
  • NexusGS captures the sharp edges of the leaves and the intricate details of the flower, matching the Ground Truth almost perfectly.

Figure 4. Visual comparisons on LLFF (3 views). Our method produces richer details and more accurate depth than all competitors.

DTU Dataset (Objects): In Figure 6, observe the reconstruction of the vase and the plush toy. Other methods like CoR-GS struggle with the fine patterns on the vase or the fuzzy texture of the toy. NexusGS preserves these high-frequency details.

Figure 6. Visual comparisons on the DTU dataset (3 views). Our method produces a more comprehensive point cloud than others, resulting in higher-quality renderings.

Depth Maps: Perhaps the most telling visualization is the depth map comparison (Figure 14). A good depth map should be smooth within objects but have sharp discontinuities at edges. Look at the orchid scene (bottom row). The competitors’ depth maps are noisy or blurry. NexusGS produces a depth map that clearly separates the petals from the background.

Figure 14. Visual comparisons of depth maps on the LLFF dataset with 3 input views.

Ablation Studies: Do the Components Matter?

The authors performed ablation studies to prove that FRDB (Blending) and FFDP (Pruning) are actually doing the heavy lifting.

In the visualization below (Figure 7), you can see the difference between simple averaging and the proposed method.

  • Avg: The point cloud is noisy, and the rendered image is blurry.
  • FRDB + FFDP: The point cloud is clean and aligns with the object surface, resulting in a sharp render.

Figure 7. Visualization of ablation study results using 3-views.

The quantitative ablation table confirms this: simply averaging depths yields a PSNR of 20.39, while the full NexusGS pipeline reaches 21.07.

Table 3. Ablation study on LLFF with 3 training views.

Conclusion and Implications

NexusGS represents a significant step forward for 3D reconstruction in “few-shot” scenarios. Its success highlights an important lesson for the field of Deep Learning and Computer Vision: Geometric priors still matter.

While the trend has been to throw more neural network layers at a problem (like learning depth from a single image), NexusGS shows that we can achieve better results by anchoring our models in the physical laws of how cameras work. By treating optical flow not just as a feature matcher, but as a geometric constraint on the epipolar line, NexusGS generates an initialization that is physically grounded.

Key Takeaways:

  1. Dense Initialization is Key: Sparse views don’t have to mean sparse point clouds. Using optical flow and geometry can synthesize the density needed for 3DGS.
  2. Geometry > Guesswork: Explicitly calculating depth via triangulation is more reliable than implicit monocular depth estimation for maintaining scale consistency across views.
  3. Smart Blending: Not all data points are equal. Analyzing the mathematical stability (gradients) of a depth calculation allows the model to ignore volatile data.

For students and researchers, NexusGS serves as a reminder that understanding the underlying geometry of vision—extrinsics, intrinsics, and epipolar lines—is just as powerful as understanding the latest neural architecture. As we move toward VR/AR applications where data might be limited (e.g., a user taking 3 quick photos with a phone), techniques like NexusGS will be essential for creating immersive, photorealistic experiences instantly.