SplatFlow: Mastering Dynamic Scene Reconstruction Without Bounding Boxes

Introduction

The race toward fully autonomous driving relies heavily on one critical resource: data. While real-world driving logs are invaluable, they are finite and often fail to capture the “long tail” of rare, dangerous edge cases. This is where simulation steps in. If we can create photorealistic, physics-compliant digital twins of the real world, we can train and test autonomous vehicles (AVs) in infinite variations of complex scenarios.

However, reconstructing a dynamic urban environment from sensor data is notoriously difficult. Modern techniques like Neural Radiance Fields (NeRFs) and the more recent 3D Gaussian Splatting (3DGS) have revolutionized static scene reconstruction. They can render buildings and parked cars with breathtaking fidelity. But put a moving truck in the frame, and things fall apart. The moving object often appears as a ghostly, blurred trail, or artifacts corrupt the static background.

Previous solutions to this problem usually required “cheating”—using expensive, human-annotated 3D bounding boxes to tell the algorithm exactly where the cars are. This limits scalability; you can’t reconstruct the whole world if you have to draw a box around every vehicle first.

Enter SplatFlow.

In this article, we will dive deep into a new research paper that introduces a self-supervised framework for dynamic scene reconstruction. SplatFlow leverages a concept called Neural Motion Flow Fields (NMFF) to separate the moving world from the static one without needing a single manual annotation. By combining the speed of Gaussian Splatting with the temporal intelligence of motion flow, SplatFlow achieves state-of-the-art rendering quality.

Figure 1. Top: Street GS [25]; Middle: PVG [1]; Bottom: Our SplatFlow. SplatFlow eliminates the need for 3D Bboxes required by Street GS, and enhances rendering quality compared to PVG.

As shown above, where other methods produce voxel-like artifacts or simplified geometries, SplatFlow generates dense, coherent depth and orientation splats, resulting in superior visual fidelity.

Background: The Challenge of Dynamics

To understand why SplatFlow is necessary, we first need to look at the limitations of standard 3D Gaussian Splatting (3DGS).

3DGS represents a scene not as a mesh or a neural network weights, but as millions of explicit 3D “splats”—ellipsoids with position, color, opacity, and rotation. To render an image, these splats are projected onto the camera plane. It is incredibly fast and produces high-quality images for static scenes.

The problem arises when the scene changes over time. If a car drives down a street, a standard 3DGS model gets confused. It receives conflicting information: “At time \(t=1\), this coordinate is an empty road. At time \(t=2\), this coordinate is a red car.” The optimization process attempts to average these contradictions, resulting in blurry, semi-transparent artifacts known as “ghosting.”

Existing Approaches and Their Flaws

Researchers have tried two main ways to fix this:

Object-Level Supervision: Methods like StreetGaussian use tracked 3D bounding boxes. They treat the background and the cars as completely different reconstruction tasks. While effective, it relies on external trackers or manual labels, which are error-prone and expensive.
Implicit Deformations: Methods like PVG (Periodic Vibration Gaussian) try to learn temporal properties implicitly without boxes. However, simply optimizing for rendering loss often isn’t enough to understand complex motion, leading to the blurry results seen in the comparison images.

SplatFlow takes a different path: it uses the geometry of the data itself (specifically LiDAR) to learn how things move, creating a flow field that guides the Gaussian splats through time.

The SplatFlow Methodology

The core innovation of SplatFlow is the seamless integration of 4D Gaussian Splatting with a Neural Motion Flow Field (NMFF). The framework operates on a simple yet powerful premise: if we can predict the 3D motion vector (flow) for every point in space at every moment in time, we can simply “slide” our dynamic Gaussians along that flow to their correct positions for any given timestamp.

Let’s break down the pipeline.

Figure 2. The pipeline of SplatFlow.

The pipeline (Figure 2) consists of three main stages:

Decomposition: Using LiDAR data and NMFF to separate the world into “static” and “dynamic” points.
Representation: Modeling the static world with standard 3D Gaussians and the dynamic world with time-dependent 4D Gaussians.
Optimization: Using optical flow distillation and rendering losses to refine the scene.

1. Neural Motion Flow Fields (NMFF)

The NMFF is the engine that drives this system. It is a set of implicit functions (MLPs) that model the temporal motion of the scene. Specifically, for any point in 3D space \((x, y, z)\) at time \(t_1\), the NMFF predicts its displacement \((\Delta x, \Delta y, \Delta z)\) and rotation change \(\Delta R\) to reach time \(t_2\).

Equation 7

But how does the network learn this motion without labels? The authors cleverly utilize LiDAR point clouds. LiDAR provides precise depth information over time. By analyzing consecutive LiDAR sweeps, the system can self-supervise the motion learning.

The researchers use a bidirectional Chamfer Distance loss to pre-train the NMFF. Essentially, the network tries to move points from the scan at time \(t\) to match the geometric shape of the scan at time \(t+1\), and vice versa.

Equation 8

If a point can be explained solely by the motion of the ego-vehicle (the car recording the data), it is classified as static. If a point requires additional motion vectors to match the next frame, it is dynamic.

Figure 3. Visualization of 3D LiDAR points within NMFF on Waymo dataset.

The visualization above demonstrates the NMFF in action. The system has successfully identified moving objects (vehicles) and assigned them motion vectors (colored by speed and angle), distinct from the static environment.

2. 4D Gaussian Representation

Once the scene is decomposed, SplatFlow initializes two sets of Gaussians:

Static 3D Gaussians: Initialized from the static LiDAR points. These represent the road, buildings, and trees.
Dynamic 4D Gaussians: Initialized from the dynamic points. These represent cars, cyclists, and pedestrians.

The dynamic Gaussians are “4D” because they change over time. However, instead of learning a completely new set of parameters for every frame (which would be inefficient), SplatFlow uses the NMFF to propagate a single set of Gaussians across time.

A 4D Gaussian at time \(t_1\) is defined by its center \(\mu\), rotation \(R\), scaling \(S\), opacity \(\alpha\), and color \(c\):

Equation 9

To find the state of this Gaussian at time \(t_2\), SplatFlow queries the NMFF for the motion flow:

Equation 10

And applies it to the Gaussian:

Equation 11

This method ensures temporal consistency. A car isn’t just a flickering collection of shapes that appears differently in every frame; it is a coherent object moving through space along a continuous trajectory defined by the flow field.

3. Optical Flow Distillation

While LiDAR provides excellent 3D geometry, it is sparse. It might hit the roof of a car but miss the bumper. To fill in the gaps and ensure the visual appearance is fluid, SplatFlow looks to the 2D domain.

The authors use “distillation” to transfer knowledge from a 2D foundation model into their 3D representation. They use a pre-trained optical flow network (SEA-RAFT) to generate pseudo-ground-truth flow maps from the 2D camera images.

During training, SplatFlow renders its own predicted optical flow by projecting the 3D motion of the Gaussians onto the 2D image plane.

Equation 13 Equation 14

It then calculates the accumulated flow for a pixel using alpha blending, similar to how it renders color:

Equation 15

The model is then penalized if its rendered optical flow doesn’t match the flow predicted by the foundation model. This effectively teaches the 4D Gaussians how to move correctly in areas where LiDAR data might be sparse or noisy.

4. Optimization and Rendering

The final rendering process involves combining the static and dynamic Gaussians. For any target viewpoint and timestamp, the model:

Warps the dynamic Gaussians to the correct time using NMFF.
Combines them with static Gaussians.
Rasterizes the scene to produce an RGB image and a depth map.

Equation 12

The total loss function combines image reconstruction quality (L1 and SSIM), depth consistency, flow consistency, and regularization.

Equation 16

Experiments and Results

The researchers evaluated SplatFlow on the challenging Waymo Open Dataset and the KITTI Dataset, benchmarks for autonomous driving scene reconstruction.

Visual Quality

The visual improvements are stark. In dynamic scenes, traditional methods struggle with motion blur. SplatFlow, by explicitly modeling the flow of the object, keeps textures sharp.

Figure 4. Visual comparison of novel view synthesis on Waymo dataset. Bounding boxes indicate the zoomed-in dynamic areas.

In Figure 4, compare the “PVG” and “EmerNeRF” columns with “SplatFlow.” The competitors show significant blurring on the moving vehicles, effectively smearing them across the road. SplatFlow maintains the structural integrity of the vehicles, closely matching the Ground Truth (G.T.).

Dynamic Decomposition

One of the most impressive capabilities of SplatFlow is its ability to cleanly separate the moving foreground from the static background without human labels.

Figure 11. Dynamic object decomposition results of SplatFlow on Waymo. Row1: Rendered scene, Row2: Corresp. Decomposition

Figure 11 shows this decomposition. The top row shows the full rendered scene. The bottom row shows only the dynamic objects. Notice that the decomposition is clean; the road and trees are almost entirely absent from the dynamic layer, and the cars are complete, not fragmented.

Quantitative Metrics

Visuals are great, but numbers tell the story of fidelity. The authors measured performance using PSNR (higher is better), SSIM (higher is better), and LPIPS (lower is better).

Table 1. Performance comparison on Waymo dataset.

On the Waymo dataset (Table 1), SplatFlow achieves a PSNR of 33.64 for image reconstruction, significantly outperforming PVG (32.46) and EmerNeRF (28.11).

The results are even more telling when focusing specifically on the dynamic elements of the scene.

Table 2. Novel view synthesis results on Waymo dataset (*denotes dynamic elements only).

Table 2 highlights that when looking strictly at dynamic regions (the moving cars), SplatFlow consistently beats PVG. For example, in Segment 2259, SplatFlow achieves a PSNR of 31.61 compared to PVG’s 22.55. This massive jump indicates that the “ghosting” problem has been largely solved.

Robustness and Efficiency

The researchers also tested how the model performs with limited data. Even when trained on only 50% or 25% of the available frames in the KITTI dataset, SplatFlow maintained high performance, degrading much more gracefully than competitor methods.

Figure 7. Visual comparison of novel view synthesis on KITTI 25% (row1), 50% (row2), and 75% (row3) dataset.

Furthermore, this method is fast. Because it relies on Gaussian Splatting rather than heavy neural network inference during the rendering phase (after the flow is computed), it achieves real-time rendering speeds—approx 40 FPS on Waymo at 1920x1280 resolution.

What makes it work? (Ablation Study)

Is every part of the complex pipeline necessary? The authors performed an ablation study (Figure 8) to find out.

Figure 8. Visual comparison of ablation study on Waymo dataset.

w/o NMFF Prior: Removing the LiDAR pre-training leads to significant blur. The model doesn’t know where to start optimizing motion.
w/o NMFF Optimization: If you don’t refine the motion field during training, details are lost.
w/o Optical Flow: Without distilling the 2D flow, the texture of the moving cars becomes less coherent.

The “Full” model (Top Left) is the only one that produces crisp, readable license plates and lights.

Looking Deeper: Flow and Depth

Finally, because SplatFlow models the physics of the scene (depth and motion), it can render more than just RGB images. It can output dense depth maps and optical flow fields that are consistent with the visual data.

Figure 15. Visualization of rendered RGB image, optical flow, and depth by SplatFlow on Waymo dataset.

In Figure 15, we see the rendered RGB (top), Optical Flow (middle), and Depth (bottom). The sharp outlines in the depth and flow maps confirm that the underlying 3D geometry is accurate. This is crucial for using these simulations to train other AV perception systems, which often rely on depth and flow data.

Conclusion and Implications

SplatFlow represents a significant step forward in the simulation of autonomous driving environments. By successfully integrating Neural Motion Flow Fields with Dynamic Gaussian Splatting, the authors have solved the “ghosting” problem that plagues dynamic scene reconstruction.

The key takeaways are:

No Labels Needed: It achieves state-of-the-art results without expensive 3D bounding box annotations.
Physics-Aware: By leveraging LiDAR geometry and motion flow, it ensures temporal consistency.
Real-Time: It retains the speed benefits of Gaussian Splatting, making it suitable for closed-loop simulation.

For students and researchers in computer vision and robotics, SplatFlow illustrates the power of hybrid approaches. It combines explicit geometric representations (Gaussians), implicit neural representations (NMFF), and cross-modal distillation (LiDAR + Camera + 2D Flow) to solve a problem that no single modality could solve alone. As we move toward safer autonomous vehicles, tools like SplatFlow will play a pivotal role in creating the virtual proving grounds of the future.

Introduction#

Background: The Challenge of Dynamics#

Existing Approaches and Their Flaws#

The SplatFlow Methodology#

1. Neural Motion Flow Fields (NMFF)#

2. 4D Gaussian Representation#

3. Optical Flow Distillation#

4. Optimization and Rendering#

Experiments and Results#

Visual Quality#

Dynamic Decomposition#

Quantitative Metrics#

Robustness and Efficiency#

What makes it work? (Ablation Study)#

Looking Deeper: Flow and Depth#

Conclusion and Implications#