Imagine a drone flying at high speed through a dimly lit tunnel. A standard camera would likely fail in this scenario; the fast motion causes severe blur, and the low light results in grainy, unusable footage. This is the bottleneck for many robotics applications today. However, there is a different kind of sensor that thrives in exactly these conditions: the Event Camera.
Event cameras are bio-inspired sensors that work differently than the cameras in our phones. Instead of capturing full frames at a fixed rate, they operate asynchronously, detecting changes in brightness at the pixel level. This gives them incredible advantages: microsecond latency, high dynamic range, and no motion blur.
But there is a catch. Because event cameras only output a stream of “changes” (events) rather than complete images, standard computer vision algorithms for 3D reconstruction don’t work on them directly. Specifically, figuring out the 3D structure of a scene and the camera’s position (pose) simultaneously—a process known as Simultaneous Localization and Mapping (SLAM)—is notoriously difficult with event data alone.
In this post, we will dive deep into IncEventGS, a new method proposed by researchers that solves this problem. They combine the efficiency of 3D Gaussian Splatting with the high-speed data of event cameras to reconstruct 3D scenes and track camera motion without needing to know the camera poses beforehand.
The Challenge: Mapping without a Map
The goal of this research is “pose-free” reconstruction. In many controlled experiments, researchers use external motion capture systems to tell the computer exactly where the camera is at every millisecond. But in the real world—on a robot or a drone—we don’t have that luxury. The system must figure out the geometry of the world and its own path through it using only the data from the camera.
Existing methods for event-based reconstruction often rely on Neural Radiance Fields (NeRFs). While powerful, NeRFs are computationally heavy and slow to train. Furthermore, most event-based NeRFs assume the camera poses are already known. IncEventGS tackles this by using 3D Gaussian Splatting (3D-GS), a newer, faster explicit representation, and wrapping it in a SLAM (Simultaneous Localization and Mapping) framework.
The IncEventGS Pipeline
At a high level, IncEventGS follows a “Tracking and Mapping” paradigm common in robotics. It processes the incoming stream of events in chunks.

As shown in Figure 1 above, the system is split into two main alternating processes:
- Tracking: The system estimates the camera’s motion for the current chunk of events. It freezes the 3D map and asks, “Given this map, how must the camera have moved to generate these events?”
- Mapping: Once the motion is estimated, the system refines the 3D map (the Gaussians) and the trajectory simultaneously to better fit the data.
Crucially, the system includes a novel initialization strategy using a depth estimation model (the “Pretrained Marigold Model” in the figure) to bootstrap the process, which we will detail later.
Core Method: Bridging Events and Gaussians
To understand how IncEventGS works, we need to understand three components: the 3D representation, the event formation model, and the trajectory modeling.
1. The 3D Scene Representation
Instead of a neural network, the scene is represented by a cloud of 3D Gaussians. You can think of these as 3D blobs, each with a position, size, orientation, color, and opacity.
Mathematically, the shape of a Gaussian is defined by a covariance matrix \(\Sigma\). To ensure this matrix is valid (positive semi-definite), it is constructed from a scaling vector \(S\) and a rotation matrix \(R\):

When we want to render an image from a specific camera view, these 3D blobs are projected onto the 2D image plane. The 3D covariance \(\Sigma\) becomes a 2D covariance \(\Sigma'\):

Here, \(J\) is the Jacobian of the projective transformation, and \(W\) is the viewing transformation. Once projected, the system calculates the color of a pixel by sorting the Gaussians from front to back and blending them. This is known as alpha blending:

In this formula, \(c_i\) is the color of the \(i\)-th Gaussian, and \(\alpha_i\) is its opacity contribution. The opacity \(\alpha\) depends on the Gaussian’s learned opacity \(o\) and how far the pixel is from the center of the Gaussian:

The system also renders depth maps and alpha maps (visibility masks) using similar blending formulations:


These equations provide a fully differentiable way to go from a list of 3D parameters to a 2D image. If we change a Gaussian’s position slightly, the pixel colors change slightly, allowing us to use gradient descent for optimization.
2. The Event Formation Model
3D Gaussian Splatting naturally renders standard color images. However, an event camera doesn’t record images; it records changes. Specifically, an event is triggered at pixel \(x\) at time \(t\) when the logarithmic brightness changes by a certain threshold \(C\).
To train the 3D Gaussians using event data, the researchers need a way to compare the “real” events with the “synthesized” scene. They do this by accumulating events over a very small time window \(\Delta t\). The measured brightness change from the event stream is:

Here, \(p_i\) is the polarity (positive or negative change). Effectively, we are summing up the events to see how much the brightness changed.
The “synthesized” brightness change is calculated by rendering two images from the 3D Gaussians: one at time \(t_k\) and one at time \(t_k + \Delta t\). The difference between the log-brightness of these two rendered images represents the expected events:

The learning signal comes from minimizing the difference between the measured events (Equation 8) and the synthesized events (Equation 9).
3. Continuous Trajectory Modeling
In standard video, a camera has a single pose per frame. Event data is continuous. If the camera moves quickly during the 50ms window of an event chunk, assuming a single static pose would result in errors.
IncEventGS models the camera trajectory as a continuous path. For any specific time \(t_k\) within a chunk, the camera pose \(T_k\) is interpolated between a start pose \(T_{start}\) and an end pose \(T_{end}\):

This interpolation (using Lie algebra operations) ensures the movement is smooth and differentiable. This allows the system to query the exact camera pose for any specific event timestamp.
The Optimization Loop
With the representations in place, the system enters its incremental loop.
Step 1: Tracking
When a new chunk of events arrives, the system assumes the 3D map is fixed. It attempts to find the best \(T_{start}\) and \(T_{end}\) for this new chunk. It does this by minimizing the difference between the real events and the events rendered from the current map:

Step 2: Mapping (Bundle Adjustment)
Once the tracking provides a good initial guess for the movement, the chunk is added to the “Mapper.” The Mapper uses a sliding window (looking at the most recent \(N\) chunks) and optimizes everything—the camera poses and the 3D Gaussians parameters—simultaneously.
The total loss function combines the event error (\(\mathcal{L}_{event}\)) and a structural similarity loss (\(\mathcal{L}_{ssim}\)) to ensure the reconstructed images look structurally plausible:

During mapping, the system also handles the growing of the map. As the camera explores new areas, new Gaussians need to be created. The system identifies pixels with low opacity in the rendered alpha map (meaning “empty space” to the current model) and spawns new Gaussians there.
The position of a new Gaussian is determined by un-projecting a pixel \(u\) using its rendered depth \(d_u\):

Ideally, we only spawn Gaussians where the current visibility \(V\) is below a threshold \(\lambda_V\):

The Initialization Problem
There is a significant hurdle in event-based reconstruction: Data Ambiguity. Because events only measure change, they don’t contain absolute intensity values (the “DC component”). If you start with a random cloud of Gaussians and a random trajectory, the optimization has a very hard time converging to a correct geometry because infinite scenes can technically produce the same brightness changes.
IncEventGS solves this with a clever bootstrapping phase.
- Initial Guess: The system starts with random initialization for the first few chunks.
- Rendering: It optimizes the Gaussians to match the events. This produces a scene that generates correct events but often has distorted 3D geometry and depth (e.g., the scene might look flat or stretched).
- Depth Refinement: The system renders a brightness image from this initial noisy reconstruction. It then feeds this image into a pre-trained Monocular Depth Estimation Model (specifically, a model called Marigold).
- Re-Initialization: This deep learning model, having been trained on millions of images, “hallucinates” the correct depth structure of the scene. The authors use this predicted depth to reposition the 3D Gaussians, correcting the geometry.

As shown in Figure 5, the “Pretrained Marigold Model” takes the initial, potentially flawed rendering and outputs a depth map. This map acts as a guide to snap the 3D Gaussians into a plausible 3D structure (the point cloud on the right). This bootstrapped state serves as the solid foundation for the rest of the incremental mapping.
We can see the impact of this strategy in the ablation study below. “w/o” refers to the method without this depth initialization. The ATE (Absolute Trajectory Error), which measures how far the estimated camera path is from reality, drops massively from 1.534 cm to just 0.046 cm when depth initialization is used.

Experimental Results
The researchers tested IncEventGS on both synthetic datasets (Replica) and real-world event data (TUM-VIE). They compared their approach against NeRF-based event methods and “Two-Stage” methods (converting events to video first, then running standard reconstruction).
Visual Quality
The visual results are striking. In the figure below, compare the “Ours” row to the other methods. NeRF-based methods (E-NeRF, EventNeRF) often struggle to resolve fine details or produce noisy artifacts.

The IncEventGS results are significantly sharper and closer to the Ground Truth (GT). This is largely due to the explicit nature of Gaussian Splatting, which preserves high-frequency details better than the implicit neural networks used in NeRFs.
We see similar success in real-world datasets. In Figure 3 (top section below), look at the clarity of the objects on the desk in the “Ours” row compared to the blurry, ghosting artifacts in “E-NeRF” or “E2VID+COLMAP”.

Trajectory Estimation
It’s not enough to just render pretty pictures; a robotic system needs to know where it is. The bottom half of the image above (Figure 4) visualizes the trajectory error. The color represents the error magnitude.
- Left (DEVO): A state-of-the-art event visual odometry method.
- Middle (E2VID+COLMAP): Converting events to video, then using standard structure-from-motion. This fails catastrophically (note the massive scale of error).
- Right (Ours): IncEventGS maintains a tight, blue (low error) path that closely follows the ground truth.
Fast Motion Performance
The raison d’être of event cameras is handling speed. The researchers tested the system in scenarios with rapid camera movement.

In Figure 6, notice how the “Blur” row (simulating what a standard camera sees) is unusable. The “Ours” row maintains sharp geometry and texture, proving that the continuous trajectory modeling and the high temporal resolution of the event data are working as intended.
Efficiency
Finally, one of the main selling points of Gaussian Splatting is speed.

IncEventGS requires significantly less training time (0.5 hours vs 12+ hours for NeRF methods) and consumes less storage. This efficiency is critical for potential deployment on mobile robots or drones with limited onboard compute.
Conclusion
IncEventGS represents a significant step forward for event-based vision. By marrying the speed and precision of 3D Gaussian Splatting with the robustness of Event Cameras, the authors have created a system that can reconstruct detailed 3D scenes without needing external tracking systems.
The clever integration of a “Tracking and Mapping” pipeline, combined with a depth-based initialization strategy to overcome the ambiguity of event data, allows this method to outperform previous NeRF-based approaches. For the future of autonomous navigation in high-speed or low-light environments, this combination of sensors and algorithms looks incredibly promising.
](https://deep-paper.org/en/paper/2410.08107/images/cover.png)