Seeing the Unseen: Incremental 3D Reconstruction with Event Cameras and Gaussian Splatting

Imagine a drone flying at high speed through a dimly lit tunnel. A standard camera would likely fail in this scenario; the fast motion causes severe blur, and the low light results in grainy, unusable footage. This is the bottleneck for many robotics applications today. However, there is a different kind of sensor that thrives in exactly these conditions: the Event Camera.

Event cameras are bio-inspired sensors that work differently than the cameras in our phones. Instead of capturing full frames at a fixed rate, they operate asynchronously, detecting changes in brightness at the pixel level. This gives them incredible advantages: microsecond latency, high dynamic range, and no motion blur.

But there is a catch. Because event cameras only output a stream of “changes” (events) rather than complete images, standard computer vision algorithms for 3D reconstruction don’t work on them directly. Specifically, figuring out the 3D structure of a scene and the camera’s position (pose) simultaneously—a process known as Simultaneous Localization and Mapping (SLAM)—is notoriously difficult with event data alone.

In this post, we will dive deep into IncEventGS, a new method proposed by researchers that solves this problem. They combine the efficiency of 3D Gaussian Splatting with the high-speed data of event cameras to reconstruct 3D scenes and track camera motion without needing to know the camera poses beforehand.

The Challenge: Mapping without a Map

The goal of this research is “pose-free” reconstruction. In many controlled experiments, researchers use external motion capture systems to tell the computer exactly where the camera is at every millisecond. But in the real world—on a robot or a drone—we don’t have that luxury. The system must figure out the geometry of the world and its own path through it using only the data from the camera.

Existing methods for event-based reconstruction often rely on Neural Radiance Fields (NeRFs). While powerful, NeRFs are computationally heavy and slow to train. Furthermore, most event-based NeRFs assume the camera poses are already known. IncEventGS tackles this by using 3D Gaussian Splatting (3D-GS), a newer, faster explicit representation, and wrapping it in a SLAM (Simultaneous Localization and Mapping) framework.

The IncEventGS Pipeline

At a high level, IncEventGS follows a “Tracking and Mapping” paradigm common in robotics. It processes the incoming stream of events in chunks.

Figure 1: The pipeline of IncEventGS.

As shown in Figure 1 above, the system is split into two main alternating processes:

Tracking: The system estimates the camera’s motion for the current chunk of events. It freezes the 3D map and asks, “Given this map, how must the camera have moved to generate these events?”
Mapping: Once the motion is estimated, the system refines the 3D map (the Gaussians) and the trajectory simultaneously to better fit the data.

Crucially, the system includes a novel initialization strategy using a depth estimation model (the “Pretrained Marigold Model” in the figure) to bootstrap the process, which we will detail later.

Core Method: Bridging Events and Gaussians

To understand how IncEventGS works, we need to understand three components: the 3D representation, the event formation model, and the trajectory modeling.

1. The 3D Scene Representation

Instead of a neural network, the scene is represented by a cloud of 3D Gaussians. You can think of these as 3D blobs, each with a position, size, orientation, color, and opacity.

Mathematically, the shape of a Gaussian is defined by a covariance matrix \(\Sigma\). To ensure this matrix is valid (positive semi-definite), it is constructed from a scaling vector \(S\) and a rotation matrix \(R\):

Equation 1: Covariance matrix decomposition.

When we want to render an image from a specific camera view, these 3D blobs are projected onto the 2D image plane. The 3D covariance \(\Sigma\) becomes a 2D covariance \(\Sigma'\):

Equation 2: Projection of 3D covariance to 2D.

Here, \(J\) is the Jacobian of the projective transformation, and \(W\) is the viewing transformation. Once projected, the system calculates the color of a pixel by sorting the Gaussians from front to back and blending them. This is known as alpha blending:

Equation 3: Pixel color rendering formula.

In this formula, \(c_i\) is the color of the \(i\)-th Gaussian, and \(\alpha_i\) is its opacity contribution. The opacity \(\alpha\) depends on the Gaussian’s learned opacity \(o\) and how far the pixel is from the center of the Gaussian:

Equation 4: Alpha calculation.

The system also renders depth maps and alpha maps (visibility masks) using similar blending formulations:

Equation 6: Depth rendering.

Equation 7: Alpha map rendering.

These equations provide a fully differentiable way to go from a list of 3D parameters to a 2D image. If we change a Gaussian’s position slightly, the pixel colors change slightly, allowing us to use gradient descent for optimization.

2. The Event Formation Model

3D Gaussian Splatting naturally renders standard color images. However, an event camera doesn’t record images; it records changes. Specifically, an event is triggered at pixel \(x\) at time \(t\) when the logarithmic brightness changes by a certain threshold \(C\).

To train the 3D Gaussians using event data, the researchers need a way to compare the “real” events with the “synthesized” scene. They do this by accumulating events over a very small time window \(\Delta t\). The measured brightness change from the event stream is:

Equation 8: Measured brightness change from events.

Here, \(p_i\) is the polarity (positive or negative change). Effectively, we are summing up the events to see how much the brightness changed.

The “synthesized” brightness change is calculated by rendering two images from the 3D Gaussians: one at time \(t_k\) and one at time \(t_k + \Delta t\). The difference between the log-brightness of these two rendered images represents the expected events:

Equation 9: Synthesized brightness change.

The learning signal comes from minimizing the difference between the measured events (Equation 8) and the synthesized events (Equation 9).

3. Continuous Trajectory Modeling

In standard video, a camera has a single pose per frame. Event data is continuous. If the camera moves quickly during the 50ms window of an event chunk, assuming a single static pose would result in errors.

IncEventGS models the camera trajectory as a continuous path. For any specific time \(t_k\) within a chunk, the camera pose \(T_k\) is interpolated between a start pose \(T_{start}\) and an end pose \(T_{end}\):

Equation 10: Linear interpolation of camera pose in Lie Algebra.

This interpolation (using Lie algebra operations) ensures the movement is smooth and differentiable. This allows the system to query the exact camera pose for any specific event timestamp.

The Optimization Loop

With the representations in place, the system enters its incremental loop.

Step 1: Tracking

When a new chunk of events arrives, the system assumes the 3D map is fixed. It attempts to find the best \(T_{start}\) and \(T_{end}\) for this new chunk. It does this by minimizing the difference between the real events and the events rendered from the current map:

Equation 11: Tracking optimization objective.

Step 2: Mapping (Bundle Adjustment)

Once the tracking provides a good initial guess for the movement, the chunk is added to the “Mapper.” The Mapper uses a sliding window (looking at the most recent \(N\) chunks) and optimizes everything—the camera poses and the 3D Gaussians parameters—simultaneously.

The total loss function combines the event error (\(\mathcal{L}_{event}\)) and a structural similarity loss (\(\mathcal{L}_{ssim}\)) to ensure the reconstructed images look structurally plausible:

Equation 12: Total loss function with SSIM.

During mapping, the system also handles the growing of the map. As the camera explores new areas, new Gaussians need to be created. The system identifies pixels with low opacity in the rendered alpha map (meaning “empty space” to the current model) and spawns new Gaussians there.

The position of a new Gaussian is determined by un-projecting a pixel \(u\) using its rendered depth \(d_u\):

Equation 13: Un-projecting pixel to 3D space.

Ideally, we only spawn Gaussians where the current visibility \(V\) is below a threshold \(\lambda_V\):

Equation 14: Visibility mask for spawning Gaussians.

The Initialization Problem

There is a significant hurdle in event-based reconstruction: Data Ambiguity. Because events only measure change, they don’t contain absolute intensity values (the “DC component”). If you start with a random cloud of Gaussians and a random trajectory, the optimization has a very hard time converging to a correct geometry because infinite scenes can technically produce the same brightness changes.

IncEventGS solves this with a clever bootstrapping phase.

Initial Guess: The system starts with random initialization for the first few chunks.
Rendering: It optimizes the Gaussians to match the events. This produces a scene that generates correct events but often has distorted 3D geometry and depth (e.g., the scene might look flat or stretched).
Depth Refinement: The system renders a brightness image from this initial noisy reconstruction. It then feeds this image into a pre-trained Monocular Depth Estimation Model (specifically, a model called Marigold).
Re-Initialization: This deep learning model, having been trained on millions of images, “hallucinates” the correct depth structure of the scene. The authors use this predicted depth to reposition the 3D Gaussians, correcting the geometry.

Figure 5: The re-initialization process.

As shown in Figure 5, the “Pretrained Marigold Model” takes the initial, potentially flawed rendering and outputs a depth map. This map acts as a guide to snap the 3D Gaussians into a plausible 3D structure (the point cloud on the right). This bootstrapped state serves as the solid foundation for the rest of the incremental mapping.

We can see the impact of this strategy in the ablation study below. “w/o” refers to the method without this depth initialization. The ATE (Absolute Trajectory Error), which measures how far the estimated camera path is from reality, drops massively from 1.534 cm to just 0.046 cm when depth initialization is used.

Table 3: Impact of Depth Initialization on trajectory error.

Experimental Results

The researchers tested IncEventGS on both synthetic datasets (Replica) and real-world event data (TUM-VIE). They compared their approach against NeRF-based event methods and “Two-Stage” methods (converting events to video first, then running standard reconstruction).

Visual Quality

The visual results are striking. In the figure below, compare the “Ours” row to the other methods. NeRF-based methods (E-NeRF, EventNeRF) often struggle to resolve fine details or produce noisy artifacts.

Figure 2: Qualitative evaluation on Replica dataset.

The IncEventGS results are significantly sharper and closer to the Ground Truth (GT). This is largely due to the explicit nature of Gaussian Splatting, which preserves high-frequency details better than the implicit neural networks used in NeRFs.

We see similar success in real-world datasets. In Figure 3 (top section below), look at the clarity of the objects on the desk in the “Ours” row compared to the blurry, ghosting artifacts in “E-NeRF” or “E2VID+COLMAP”.

Figure 3 and 4: Real world qualitative results and Trajectory Error visualization.

Trajectory Estimation

It’s not enough to just render pretty pictures; a robotic system needs to know where it is. The bottom half of the image above (Figure 4) visualizes the trajectory error. The color represents the error magnitude.

Left (DEVO): A state-of-the-art event visual odometry method.
Middle (E2VID+COLMAP): Converting events to video, then using standard structure-from-motion. This fails catastrophically (note the massive scale of error).
Right (Ours): IncEventGS maintains a tight, blue (low error) path that closely follows the ground truth.

Fast Motion Performance

The raison d’être of event cameras is handling speed. The researchers tested the system in scenarios with rapid camera movement.

Figure 6: Novel view synthesis under fast camera motion.

In Figure 6, notice how the “Blur” row (simulating what a standard camera sees) is unusable. The “Ours” row maintains sharp geometry and texture, proving that the continuous trajectory modeling and the high temporal resolution of the event data are working as intended.

Efficiency

Finally, one of the main selling points of Gaussian Splatting is speed.

Table 5: Model efficiency comparison.

IncEventGS requires significantly less training time (0.5 hours vs 12+ hours for NeRF methods) and consumes less storage. This efficiency is critical for potential deployment on mobile robots or drones with limited onboard compute.

Conclusion

IncEventGS represents a significant step forward for event-based vision. By marrying the speed and precision of 3D Gaussian Splatting with the robustness of Event Cameras, the authors have created a system that can reconstruct detailed 3D scenes without needing external tracking systems.

The clever integration of a “Tracking and Mapping” pipeline, combined with a depth-based initialization strategy to overcome the ambiguity of event data, allows this method to outperform previous NeRF-based approaches. For the future of autonomous navigation in high-speed or low-light environments, this combination of sensors and algorithms looks incredibly promising.

The Challenge: Mapping without a Map#

The IncEventGS Pipeline#

Core Method: Bridging Events and Gaussians#

1. The 3D Scene Representation#

2. The Event Formation Model#

3. Continuous Trajectory Modeling#

The Optimization Loop#

Step 1: Tracking#

Step 2: Mapping (Bundle Adjustment)#

The Initialization Problem#

Experimental Results#

Visual Quality#

Trajectory Estimation#

Fast Motion Performance#

Efficiency#

Conclusion#