Imagine you are trying to create a 3D model of a scene using a camera mounted on a high-speed train or a racing drone. Traditional cameras fail here—they suffer from massive motion blur due to fixed exposure times. This is where spike cameras come in. Inspired by the biological retina, these sensors capture light as a continuous stream of binary spikes (0s and 1s) at frequencies up to 40,000 Hz, theoretically eliminating motion blur.
However, turning those binary spikes into a high-fidelity 3D scene is incredibly difficult. The conventional approach involves a fragile assembly line of steps: first reconstruct an image, then figure out where the camera was, and finally build the 3D model. If the first step is slightly off, the whole system collapses.
In this post, we are diving deep into USP-Gaussian, a research paper that proposes a way to break this cycle. The authors introduce a unified framework that performs image reconstruction, camera pose correction, and 3D Gaussian Splatting (3DGS) simultaneously.
The Problem: The Cascading Error Trap
To understand why USP-Gaussian is necessary, we first need to look at how researchers currently handle spike camera data for 3D reconstruction. The standard workflow is a “cascaded” pipeline:
- Spike-to-Image: Use a neural network to turn the noisy binary spike stream into a clean 2D image.
- Pose Estimation: Use those reconstructed images to calculate the camera’s position and orientation (using tools like COLMAP).
- Novel View Synthesis: Feed the images and poses into a 3D engine like NeRF or 3D Gaussian Splatting to render the scene.
The fatal flaw here is error propagation. If the initial image reconstruction is noisy or lacks texture (which is common in high-speed scenarios), the pose estimation will be inaccurate. If the poses are wrong, the 3D reconstruction will be blurry and riddled with artifacts.
USP-Gaussian proposes a “synergistic optimization” framework. Instead of doing these steps one by one, why not let them help each other?

As shown in the figure above (Left), the framework consists of two parallel branches: a Reconstruction Network (Recon-Net) and a 3D Gaussian Splatting (3DGS) module. They are connected by a joint loss function, allowing them to correct each other during training.
Background: Spikes and Splats
Before dissecting the architecture, let’s briefly establish the two core technologies at play.
1. The Spike Camera
Unlike a standard camera that accumulates light over a fixed exposure time, a spike camera monitors photon intensity continuously. Each pixel has an integrator. When the integrated voltage reaches a threshold, it fires a “spike” (a 1) and resets.

As captured in Figure 2 above, the camera outputs a stream of bits. High light intensity results in frequent firing; low light results in sparse firing. Mathematically, the accumulation of voltage \(A(t)\) over time is described as:

This mechanism allows the camera to capture extremely fast motion, such as a railway moving at 350 km/h, without traditional blur.
2. 3D Gaussian Splatting (3DGS)
3DGS is a modern alternative to Neural Radiance Fields (NeRF). Instead of using a neural network to predict color at every point in space (which is slow), 3DGS represents the scene as a cloud of 3D Gaussians (ellipsoids).
Each Gaussian has a position, opacity, color, and covariance (shape). To render an image, these 3D Gaussians are projected (“splatted”) onto a 2D plane. The influence of a Gaussian on a point \(\mathbf{v}\) is calculated as:

When projected to 2D for rendering, the covariance matrix is transformed:

Finally, the pixel color is computed by alpha-blending these sorted 2D Gaussians:

The Core Method: USP-Gaussian
The goal of USP-Gaussian is to take a set of spike streams with rough, inaccurate camera poses and output refined poses, a sharp 3D model, and high-quality reconstructed images.
The overall pipeline is visualized below. It might look complex, but we will break it down into three manageable parts: the Reconstruction Branch, the 3DGS Branch, and the Joint Optimization.

The mathematical goal is to optimize the spike streams (\(S\)), poses (\(P\)), Gaussian primitives (\(G\)), and network parameters (\(\theta\)) simultaneously:

Part 1: The Recon-Net Branch
The left side of the pipeline (in Figure 3) handles Spike-based Image Reconstruction. The challenge here is that a short window of spikes might not have enough information (few photons), but a long window introduces motion blur.
The Solution: Complementary Long-Short Input The authors use a clever hybrid input. They feed the network:
- A Long Spike Stream: Provides rich texture and context but contains motion info.
- A Short Spike Stream: Extracted from the center of the long stream to pinpoint the exact moment in time.
Self-Supervised Training via “Reblurring” Since we don’t have “ground truth” sharp images for real-world high-speed data, how do we train this network? We use physics.
We know that if we sum up all the spikes over a long period \(T\), we get a “long-exposure” image (which will be blurry). We can calculate this long-exposure image \(\mathbf{E}(T)\) directly from the raw spikes:

The network tries to predict a sharp sequence of images. To check if it’s correct, we average (re-blur) the predicted sharp sequence and compare it to the ground-truth long-exposure image \(\mathbf{E}(T)\).
However, a standard reblur loss has a loophole: the network might just learn to output the blurry image every time. To prevent this, the authors introduce a Multi-reblur Loss. They chop the long interval into sub-intervals and enforce the reblur constraint on each sub-section individually.

This forces the network to maintain temporal consistency and prevents it from learning a trivial identity mapping.
Part 2: The 3DGS Branch with Trajectory Modeling
The right side of the pipeline deals with 3D reconstruction. Standard 3DGS assumes the camera is static during a single frame. But with a spike camera, we are looking at a continuous stream where the camera is moving while capturing data.
Modeling Motion The authors model the camera trajectory within a time interval \(\mathcal{T}\) by defining a Start Pose and an End Pose. The pose at any specific timestamp \(t_m\) is found by interpolating between them using Lie algebra:

This allows the system to correct the camera trajectory during training. Just like the Recon-Net, the 3DGS branch is also supervised by a reblur loss—the rendered images from the Gaussians, when averaged, must match the long-exposure spike image.

Part 3: Joint Optimization (The Secret Sauce)
Here is where the magic happens. We now have two branches producing images of the same scene at the same time:
- Recon-Net produces sharp images from spikes.
- 3DGS renders sharp images from Gaussians.
The Joint Loss forces these two outputs to match. This creates a positive feedback loop:
- The 3DGS (which understands 3D geometry and multi-view consistency) prevents the Recon-Net from hallucinating artifacts.
- The Recon-Net (which understands the raw sensor data) helps the 3DGS learn fine textures that might be lost in the Gaussians.

The Reversal Problem There is one subtle issue: motion ambiguity. Since the “reblur” loss just sums up frames, it doesn’t care if the sequence plays forward or backward. The optimized pose sequence might accidentally reverse time!
To fix this, the authors use a “flip-and-minimum” operation. They calculate the loss for both the normal sequence and the reversed sequence, and simply take the minimum.

The final loss function combines everything: the reconstruction loss, the Gaussian loss, and the joint loss.

Experiments and Results
The researchers tested USP-Gaussian on both synthetic datasets (using Blender-rendered scenes) and real-world data (captured by shaking a spike camera vigorously).
1. Performance on Synthetic Data
The results on synthetic data were impressive. The table below compares USP-Gaussian against cascaded methods (like TFP-3DGS) and other spike-specific methods (SpikeNeRF, SpikeGS).

You can see that USP-Gaussian achieves the highest PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity) scores.
Visually, the difference is stark. In Figure 4 below, look at the text on the sign and the details of the railing. Previous methods blur these features or introduce “floaters” (noise artifacts), while USP-Gaussian recovers clean geometry and texture.

2. Real-World Robustness
Real-world data is messy. The initial poses estimated by COLMAP are often terrible because the raw spike images are noisy.
In the comparison below, look at the “Input” column—it’s barely recognizable. The “Ours” column (USP-Gaussian) recovers fine details, such as the keys on the keyboard and the architectural features of the building, which other methods wash out.

3. Pose Correction
One of the biggest claims of the paper is that it can fix bad camera poses. To test this, they added random noise (perturbations) to the initial camera poses—up to 30% error.
As shown in the table below, even with 30% initial error, USP-Gaussian maintains a high PSNR (23.46 dB), whereas the competitor SpikeGS collapses to 16.44 dB.

The visual trajectory plot confirms this. The dashed red line (Initial) is far off the solid black line (Reference). The dotted blue line (Optimized) snaps back onto the correct path.

4. Why Joint Learning Matters (Ablation)
Is the complex joint architecture really necessary? The authors performed an ablation study, turning off different parts of the loss function.

Looking at the graph on the right of Figure 1 (and the table below), we see that independent training (blue stars/green circles) plateaus much lower than joint training (red triangles/black squares). The Recon-Net needs 3DGS for consistency, and 3DGS needs Recon-Net for texture.

They also validated the “Long-Short” input strategy. Without the long spike stream input, the reconstruction (Middle) is noisier compared to the full model (Right).

Conclusion
USP-Gaussian represents a significant step forward for high-speed 3D vision. By abandoning the traditional linear pipeline in favor of a unified, iterative loop, the authors successfully mitigated the “cascading error” problem.
Key Takeaways:
- Don’t trust the pipeline: Sequential steps (Image -> Pose -> 3D) propagate errors. Solving them jointly is robust.
- Mutual Benefit: 2D sensing and 3D geometry constraints can supervise each other.
- Physics-aware training: Using reblurring losses allows for self-supervised training on data that is impossible for humans to label manually.
While the method requires more GPU memory and training time than simpler approaches, the payoff in quality and robustness—especially in scenarios with rapid motion and jitter—is undeniable. This research paves the way for robots and drones that can understand their 3D environment with superhuman speed and precision.
](https://deep-paper.org/en/paper/2411.10504/images/cover.png)