Taming the Hallucinations: How Video Diffusion Improves Sparse 3D Gaussian Splatting
Introduction
Imagine you are trying to reconstruct a detailed 3D model of a room, but you only have six photographs taken from the center. This is the challenge of sparse-input 3D reconstruction. While recent technologies like 3D Gaussian Splatting (3DGS) have revolutionized how we render scenes, they typically demand a dense cloud of images to work their magic. When you feed them only a handful of views, the results are often riddled with “black holes,” floating artifacts, and blurred geometry.
The core of the problem lies in two areas: extrapolation (what does the room look like outside the camera’s current field of view?) and occlusion (what is hiding behind that sofa?).
A new research paper, “Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs,” proposes a fascinating solution: Reconstruction by Generation. The researchers suggest that if we don’t have enough images, we should use Generative AI to dream them up.
However, simply asking an AI to “imagine the rest of the room” creates a new problem: inconsistencies. As shown in the image below, standard video generation can create flickering textures and “hallucinated” objects that don’t exist, leading to black shadows in the final 3D model.

This post breaks down how the researchers solved this by “taming” a video diffusion model using a novel guidance system, allowing for state-of-the-art 3D reconstruction from very sparse inputs.
Background: The Sparse Input Challenge
To understand the solution, we must first appreciate the problem.
3D Gaussian Splatting (3DGS) represents a scene not as a mesh or a neural network, but as millions of 3D blobs (Gaussians). Each blob has a position, rotation, scale, color, and opacity. When optimized correctly, these blobs blend together to create photorealistic images. However, 3DGS optimization relies on multi-view consistency. It needs to see a point in space from multiple angles to determine exactly where a Gaussian should be placed.
When you only have sparse inputs (e.g., 3 to 6 images), the system suffers from shape-radiance ambiguity. It can’t tell if a color change is due to lighting, texture, or geometry. Furthermore, if you move the camera slightly to the left, you might look at a wall that was never captured in the input photos. In standard 3DGS, this results in empty space or “floaters.”
The Generative Opportunity
Diffusion models (like those behind Stable Diffusion or Sora) have learned priors about the visual world from massive datasets. They know what a chair usually looks like from the back, or how a floor pattern typically continues. The researchers leverage a Video Diffusion Model to generate short video clips moving from known camera positions into these unknown areas.
The Core Method: Taming the Diffusion
The researchers propose a pipeline that consists of three main stages:
- Trajectory Initialization: figuring out where to move the camera to see the “holes.”
- Scene-Grounding Guidance: Generating video sequences that fill these holes without hallucinating nonsense.
- Optimization: Training the final 3DGS model using both real images and the synthetic videos.

1. Finding the Invisible: Trajectory Initialization
Before generating video, we need to know where to look. We can’t just generate random camera movements; we need to target the areas that are occluded or outside the field of view.
The team starts by training a “rough” baseline 3DGS model on the sparse inputs. It’s not pretty, but it gives a coarse geometry of the scene. They then render views from various candidate poses and check the transmittance map.
The transmittance map essentially tells us how transparent a view is. If a region has high transmittance (appearing as a black hole in the render), it means there are no Gaussians there—it’s an unobserved region.

By calculating masks based on this transmittance, they select trajectories that transition smoothly from a known input view into these unknown “hole” regions. This creates a specific path:
\[ \Phi = \{ \{ \phi _ { j } ^ { ( i , c ) } \} _ { j = 1 } ^ { L } | i , c \} , \]where \(\phi\) represents the camera pose sequence.
2. Scene-Grounding Guidance
This is the most critical technical contribution of the paper.
If you feed the starting image and the calculated trajectory into a standard video diffusion model (like ViewCrafter), it will generate a video. However, diffusion models are stochastic—they make things up. Frame 5 might show a window, and Frame 10 might turn it into a painting. This temporal inconsistency is disastrous for 3D reconstruction.
To fix this, the authors introduce Scene-Grounding Guidance. They use the imperfect renders from the baseline 3DGS to guide the diffusion process.
Why use bad renders? Even though the baseline 3DGS renders have holes and blur, they are 3D consistent by definition (because they come from a static 3D model). They act as a geometric anchor. The diffusion model provides the texture and realism, while the 3DGS render ensures the structure doesn’t morph over time.
The Mathematical Mechanism
Standard diffusion sampling iteratively removes noise from a latent variable \(\mathbf{x}_t\) using a predicted noise \(\epsilon_\theta\):

The researchers modify this process by adding a guidance term based on the baseline 3DGS renders. They define a consistency target \(\mathcal{Q}\) (derived from the rendered sequence). The new sampling update includes the gradient of the probability of this target:

In simpler terms, at every step of the video generation, the model is “nudged” not just to look realistic (the diffusion prior), but also to match the structure of the baseline 3DGS render.
The loss function \(\mathcal{L}\) used for this nudge compares the generated frame \(\mathbf{X}_{0|t}\) with the rendered frame \(\mathbf{S}\), specifically focusing on the areas that are visible (using mask \(\mathbf{M}\)):

This combination forces the generated video to respect the known geometry of the scene while using the diffusion model’s creativity to fill in the missing textures and details plausibly.
The Impact of Guidance: Without this guidance, the video diffusion model generates hallucinations. Notice in Figure 4 below how the “Vanilla Generation” results in a 3D model with black shadows (red boxes) because the generated views contradicted each other.

3. Optimization with Generated Sequences
Once the consistent video sequences are generated, they are treated as pseudo-ground-truth data to train the final 3DGS model.
The training alternates between sampling real input images and synthetic generated views. For real images, they use the standard reconstruction loss:

For the generated views, they found that standard pixel-perfect loss (L1) isn’t enough because the generated images might still have slight imperfections. To encourage the model to fill holes without getting hung up on pixel-perfect alignment in the generated regions, they rely heavily on Perceptual Loss (LPIPS):

The visual impact of using this perceptual loss on the generated views is profound, as it helps the model focus on structural completeness rather than high-frequency noise.

Experiments and Results
The researchers evaluated their method on two challenging indoor datasets: Replica (synthetic) and ScanNet++ (real-world, high fidelity). They compared their approach against leading sparse-input methods like FreeNeRF, DNGaussian, and FSGS.
Quantitative Success
The method achieves state-of-the-art results. On the Replica dataset, it boosted PSNR (a measure of image quality) by over 3.5 dB compared to the baseline. This is a massive leap in signal processing terms, indicating a significantly clearer and more accurate image.

Qualitative Analysis
Visually, the difference is stark. In the comparisons below, look at the edges of the room and areas behind objects.
- Row 1 (Replica): Notice the wall behind the chair. Other methods leave it blurry or empty. The proposed method (Ours) reconstructs it cleanly.
- Row 3 (ScanNet++): The ceiling area is often completely missed by standard methods (extrapolation issue), but this approach fills it in plausibly.

The method also scales well across different scenes, consistently maintaining structural integrity where other methods dissolve into artifacts.

Better than 2D Inpainting?
You might wonder: why generate video? Why not just use 2D inpainting (like Photoshop Generative Fill) on the empty spots?
The researchers compared their method against 2D inpainting approaches (LaMa and Stable Diffusion Inpainting). The results (Figure 8) show that while 2D inpainting fills the holes, it ignores 3D geometry. It paints “flat” textures that look wrong when viewed from an angle. The video diffusion approach, guided by the baseline 3DGS, respects the 3D nature of the scene.

Conclusion
This paper tackles one of the most stubborn hurdles in 3D computer vision: creating something from (almost) nothing. By cleverly combining the “creative” power of video diffusion models with the “structural” constraint of 3D Gaussian Splatting, the authors demonstrate a way to fill in the blanks of a sparse scene.
Key takeaways for students and researchers:
- Generative Priors are Powerful: We can use pre-trained video models to act as “imagination” for 3D reconstruction.
- Constraint is Key: Generative models need to be tamed. Using a coarse proxy (like the baseline 3DGS) to guide the generation is a highly effective technique.
- Holistic Modeling: Solving sparse reconstruction requires explicitly hunting for the “holes” (trajectory initialization) rather than just hoping the optimization fills them.
This work paves the way for systems that can scan a room with just a few snapshots and let you walk around it in virtual reality, with the AI filling in the blind spots seamlessly.
](https://deep-paper.org/en/paper/2503.05082/images/cover.png)