Taming the Hallucinations: How Video Diffusion Improves Sparse 3D Gaussian Splatting

Introduction

Imagine you are trying to reconstruct a detailed 3D model of a room, but you only have six photographs taken from the center. This is the challenge of sparse-input 3D reconstruction. While recent technologies like 3D Gaussian Splatting (3DGS) have revolutionized how we render scenes, they typically demand a dense cloud of images to work their magic. When you feed them only a handful of views, the results are often riddled with “black holes,” floating artifacts, and blurred geometry.

The core of the problem lies in two areas: extrapolation (what does the room look like outside the camera’s current field of view?) and occlusion (what is hiding behind that sofa?).

A new research paper, “Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs,” proposes a fascinating solution: Reconstruction by Generation. The researchers suggest that if we don’t have enough images, we should use Generative AI to dream them up.

However, simply asking an AI to “imagine the rest of the room” creates a new problem: inconsistencies. As shown in the image below, standard video generation can create flickering textures and “hallucinated” objects that don’t exist, leading to black shadows in the final 3D model.

Figure 1: Comparison of approaches. (a) and (b) show the extrapolation and occlusion issues. Vanilla generation leads to inconsistencies (yellow arrows) and black shadows. The proposed scene-grounding generation (c) creates consistent sequences.

This post breaks down how the researchers solved this by “taming” a video diffusion model using a novel guidance system, allowing for state-of-the-art 3D reconstruction from very sparse inputs.

Background: The Sparse Input Challenge

To understand the solution, we must first appreciate the problem.

3D Gaussian Splatting (3DGS) represents a scene not as a mesh or a neural network, but as millions of 3D blobs (Gaussians). Each blob has a position, rotation, scale, color, and opacity. When optimized correctly, these blobs blend together to create photorealistic images. However, 3DGS optimization relies on multi-view consistency. It needs to see a point in space from multiple angles to determine exactly where a Gaussian should be placed.

When you only have sparse inputs (e.g., 3 to 6 images), the system suffers from shape-radiance ambiguity. It can’t tell if a color change is due to lighting, texture, or geometry. Furthermore, if you move the camera slightly to the left, you might look at a wall that was never captured in the input photos. In standard 3DGS, this results in empty space or “floaters.”

The Generative Opportunity

Diffusion models (like those behind Stable Diffusion or Sora) have learned priors about the visual world from massive datasets. They know what a chair usually looks like from the back, or how a floor pattern typically continues. The researchers leverage a Video Diffusion Model to generate short video clips moving from known camera positions into these unknown areas.

The Core Method: Taming the Diffusion

The researchers propose a pipeline that consists of three main stages:

  1. Trajectory Initialization: figuring out where to move the camera to see the “holes.”
  2. Scene-Grounding Guidance: Generating video sequences that fill these holes without hallucinating nonsense.
  3. Optimization: Training the final 3DGS model using both real images and the synthetic videos.

Figure 2: Framework overview. The system initializes a baseline 3DGS, calculates trajectories to cover unobserved regions (yellow), generates video sequences using a guided diffusion model, and finally optimizes the 3DGS.

1. Finding the Invisible: Trajectory Initialization

Before generating video, we need to know where to look. We can’t just generate random camera movements; we need to target the areas that are occluded or outside the field of view.

The team starts by training a “rough” baseline 3DGS model on the sparse inputs. It’s not pretty, but it gives a coarse geometry of the scene. They then render views from various candidate poses and check the transmittance map.

The transmittance map essentially tells us how transparent a view is. If a region has high transmittance (appearing as a black hole in the render), it means there are no Gaussians there—it’s an unobserved region.

Figure 3: Trajectory Initialization. The system identifies candidate poses where the rendering has significant holes (red boxes) and creates a camera trajectory moving from a known input view toward this unobserved region.

By calculating masks based on this transmittance, they select trajectories that transition smoothly from a known input view into these unknown “hole” regions. This creates a specific path:

\[ \Phi = \{ \{ \phi _ { j } ^ { ( i , c ) } \} _ { j = 1 } ^ { L } | i , c \} , \]

where \(\phi\) represents the camera pose sequence.

2. Scene-Grounding Guidance

This is the most critical technical contribution of the paper.

If you feed the starting image and the calculated trajectory into a standard video diffusion model (like ViewCrafter), it will generate a video. However, diffusion models are stochastic—they make things up. Frame 5 might show a window, and Frame 10 might turn it into a painting. This temporal inconsistency is disastrous for 3D reconstruction.

To fix this, the authors introduce Scene-Grounding Guidance. They use the imperfect renders from the baseline 3DGS to guide the diffusion process.

Why use bad renders? Even though the baseline 3DGS renders have holes and blur, they are 3D consistent by definition (because they come from a static 3D model). They act as a geometric anchor. The diffusion model provides the texture and realism, while the 3DGS render ensures the structure doesn’t morph over time.

The Mathematical Mechanism

Standard diffusion sampling iteratively removes noise from a latent variable \(\mathbf{x}_t\) using a predicted noise \(\epsilon_\theta\):

Equation 2: Standard diffusion sampling update rule.

The researchers modify this process by adding a guidance term based on the baseline 3DGS renders. They define a consistency target \(\mathcal{Q}\) (derived from the rendered sequence). The new sampling update includes the gradient of the probability of this target:

Equation 3: Conditional score function expansion using Bayesian rule. Equation 4: The guidance term is proportional to the negative gradient of the loss function.

In simpler terms, at every step of the video generation, the model is “nudged” not just to look realistic (the diffusion prior), but also to match the structure of the baseline 3DGS render.

The loss function \(\mathcal{L}\) used for this nudge compares the generated frame \(\mathbf{X}_{0|t}\) with the rendered frame \(\mathbf{S}\), specifically focusing on the areas that are visible (using mask \(\mathbf{M}\)):

Equation 6: The guidance loss function combining L1 loss and Perceptual loss.

This combination forces the generated video to respect the known geometry of the scene while using the diffusion model’s creativity to fill in the missing textures and details plausibly.

The Impact of Guidance: Without this guidance, the video diffusion model generates hallucinations. Notice in Figure 4 below how the “Vanilla Generation” results in a 3D model with black shadows (red boxes) because the generated views contradicted each other.

Figure 4: The effect of vanilla vs. guided generation. Vanilla generation leads to black shadows due to inconsistency. Guided generation resolves this.

3. Optimization with Generated Sequences

Once the consistent video sequences are generated, they are treated as pseudo-ground-truth data to train the final 3DGS model.

The training alternates between sampling real input images and synthetic generated views. For real images, they use the standard reconstruction loss:

Equation 8: Loss for real input images.

For the generated views, they found that standard pixel-perfect loss (L1) isn’t enough because the generated images might still have slight imperfections. To encourage the model to fill holes without getting hung up on pixel-perfect alignment in the generated regions, they rely heavily on Perceptual Loss (LPIPS):

Equation 9: Loss for generated views, emphasizing perceptual similarity.

The visual impact of using this perceptual loss on the generated views is profound, as it helps the model focus on structural completeness rather than high-frequency noise.

Figure 7: Ablation study showing how perceptual loss helps model the hole regions (red box) much better than baseline.

Experiments and Results

The researchers evaluated their method on two challenging indoor datasets: Replica (synthetic) and ScanNet++ (real-world, high fidelity). They compared their approach against leading sparse-input methods like FreeNeRF, DNGaussian, and FSGS.

Quantitative Success

The method achieves state-of-the-art results. On the Replica dataset, it boosted PSNR (a measure of image quality) by over 3.5 dB compared to the baseline. This is a massive leap in signal processing terms, indicating a significantly clearer and more accurate image.

Table 1 comparison logic: The method outperforms baselines like FreeNeRF and DNGaussian significantly in PSNR and SSIM metrics.

Qualitative Analysis

Visually, the difference is stark. In the comparisons below, look at the edges of the room and areas behind objects.

  • Row 1 (Replica): Notice the wall behind the chair. Other methods leave it blurry or empty. The proposed method (Ours) reconstructs it cleanly.
  • Row 3 (ScanNet++): The ceiling area is often completely missed by standard methods (extrapolation issue), but this approach fills it in plausibly.

Figure 5 and 6: Qualitative comparisons. The proposed method handles extrapolation (ceilings/walls) and occlusion (behind objects) much better than competitors, predicting plausible geometry.

The method also scales well across different scenes, consistently maintaining structural integrity where other methods dissolve into artifacts.

Figure A3: Further qualitative comparisons across ScanNet++ and Replica datasets showing reduced motion artifacts and better structural definition.

Better than 2D Inpainting?

You might wonder: why generate video? Why not just use 2D inpainting (like Photoshop Generative Fill) on the empty spots?

The researchers compared their method against 2D inpainting approaches (LaMa and Stable Diffusion Inpainting). The results (Figure 8) show that while 2D inpainting fills the holes, it ignores 3D geometry. It paints “flat” textures that look wrong when viewed from an angle. The video diffusion approach, guided by the baseline 3DGS, respects the 3D nature of the scene.

Figure 8: Comparison with inpainting methods. The proposed method produces consistent 3D geometry, whereas inpainting often creates flat or mismatched textures.

Conclusion

This paper tackles one of the most stubborn hurdles in 3D computer vision: creating something from (almost) nothing. By cleverly combining the “creative” power of video diffusion models with the “structural” constraint of 3D Gaussian Splatting, the authors demonstrate a way to fill in the blanks of a sparse scene.

Key takeaways for students and researchers:

  1. Generative Priors are Powerful: We can use pre-trained video models to act as “imagination” for 3D reconstruction.
  2. Constraint is Key: Generative models need to be tamed. Using a coarse proxy (like the baseline 3DGS) to guide the generation is a highly effective technique.
  3. Holistic Modeling: Solving sparse reconstruction requires explicitly hunting for the “holes” (trajectory initialization) rather than just hoping the optimization fills them.

This work paves the way for systems that can scan a room with just a few snapshots and let you walk around it in virtual reality, with the AI filling in the blind spots seamlessly.