Imagine trying to reconstruct a full 3D room from just two photographs. In the world of computer vision, this “sparse-view reconstruction” is the holy grail for Virtual Reality (VR) and Augmented Reality (AR). Recently, 3D Gaussian Splatting (3DGS) has revolutionized this field, offering real-time rendering speeds that older methods, like NeRFs, struggled to achieve.
However, there is a catch. Most of these breakthroughs rely on “perspective” images—standard photos with a limited field of view. But if you want to capture a whole room quickly, you don’t take fifty narrow photos; you take one or two omnidirectional (360°) images.
Here lies the problem: standard AI models hate 360° images. The distortion, particularly at the poles (like looking at Antarctica on a flat map), breaks the geometric assumptions these models rely on.
Enter OmniSplat, a new framework that successfully tames 3D Gaussian Splatting for omnidirectional images. By rethinking how we represent spherical data, OmniSplat allows for the generation of high-fidelity 3D scenes from sparse 360° inputs in a single forward pass—no lengthy optimization required.

As shown above, OmniSplat (red diamonds) achieves a superior balance between reconstruction quality (PSNR) and speed compared to existing methods, making it a potential game-changer for immersive content creation.
The Problem with 360° Images in AI
To understand why OmniSplat is necessary, we must first look at the limitations of Feed-Forward 3DGS.
Traditional 3DGS requires training on a specific scene for minutes or hours (“optimization-based”). Feed-forward networks, however, are trained on massive datasets to “guess” the 3D structure instantly from new images (“generalizable”). Models like PixelSplat or MVSplat are excellent at this for standard photos.
But when you feed an omnidirectional image (usually an Equirectangular Projection or ERP) into these networks, they fail. In an ERP, the top and bottom of the image are infinitely stretched. A standard Convolutional Neural Network (CNN) slides a fixed-size window across the image. At the equator, this window sees normal objects. At the poles, it sees stretched, distorted messes. The network misinterprets the context, resulting in distorted 3D Gaussians and artifacts in the final render.
The OmniSplat Solution: The Yin-Yang Grid
The researchers behind OmniSplat didn’t try to force a square peg into a round hole. Instead, they changed the “hole.” They adopted a spherical coordinate system known as the Yin-Yang grid.
Think of a tennis ball. It is covered by two identical, interlocking felt patches. The Yin-Yang grid works similarly, dividing the sphere into two overlapping grids: the Yin (North) grid and the Yang (East) grid.

The architecture, illustrated above, follows a distinct pipeline:
- Decomposition: The input omnidirectional images are split into Yin and Yang grids.
- Feature Extraction: A pre-trained encoder processes these grids.
- Cross-View Attention: Features are warped and matched to estimate depth.
- Rasterization: The scene is rendered back onto Yin-Yang grids and stitched into a final 360° view.
Why Yin-Yang?
The mathematical definition of the Yin grid covers a specific range of elevation (\(\theta\)) and azimuth (\(\phi\)):

The Yang grid covers the rest of the sphere and is essentially the Yin grid rotated by 90 degrees. The transformation matrix \(M\) converts coordinates between the two:

The genius of this approach is “Quasi-Uniformness.” Unlike an Equirectangular Projection, where pixels near the poles represent tiny slivers of space compared to pixels at the equator, the Yin-Yang grid maintains a fairly consistent pixel density. This means a CNN sees roughly the same object shape regardless of where it appears on the grid. Consequently, OmniSplat can utilize powerful, pre-trained feature extractors (originally trained on perspective images) without them getting confused by spherical distortion.
Constructing the Scene: Cross-View Attention
Once the images are decomposed into Yin and Yang grids, the model needs to understand the 3D geometry. It does this by comparing two different reference views (e.g., two 360° photos taken a few steps apart).
OmniSplat uses Cross-View Attention. It takes the features from one view (source) and “warps” them to the perspective of the other view (target) based on hypothetical depth levels. This is often called “plane sweeping” in computer vision.
However, simple warping isn’t enough because a point visible in the Yin grid of Camera A might appear in the Yang grid of Camera B. The system must warp features across both grids.

In the equation above, \(\mathcal{W}\) represents the warping function using camera poses \(\mathbf{P}\). The model computes a mask \(\mathbf{M}\) to handle valid overlaps. The features from the Yin and Yang grids are then combined based on these masks to ensure no information is lost during the transition between views:

Once the features are aligned, the model calculates the correlation (similarity) between the views. High correlation at a specific depth hypothesis indicates that a surface actually exists at that distance.

This “Cost Volume” \(\mathbf{C}\) serves as the foundation for predicting the parameters of the 3D Gaussians: their position, opacity, color (spherical harmonics), and covariance (shape).
The Yin-Yang Rasterizer
Predicting the Gaussians is only half the battle. You also need to render them into an image. Standard omnidirectional rasterizers project 3D Gaussians directly onto a sphere. However, because sampling density varies so wildly on a sphere (dense at poles, sparse at equator), direct rendering often results in stripe-like artifacts or holes when using feed-forward predictions.
OmniSplat introduces Yin-Yang Rasterization. Instead of rendering the full 360° image at once, it renders two separate perspective-like images: one for the Yin grid and one for the Yang grid.

Here, \(\hat{V}\) is the color map and \(\hat{A}\) is the alpha (opacity) map. By dividing the color by the alpha accumulation, the system normalizes the pixel values, eliminating artifacts caused by uneven Gaussian density. Finally, these two rendered grids are stitched together in pixel space to form the final high-quality omnidirectional image.
Experimental Results
Does this complex coordinate transformation actually pay off? The results suggest a resounding yes.
The researchers compared OmniSplat against:
- ODGS: An optimization-based method (slow, per-scene training).
- PixelSplat/MVSplat (Perspective): Existing models run on cube-map projections.
- PixelSplat/MVSplat (Omnidirectional): Existing models modified to run directly on 360° images.
Quantitative Analysis

In the table above, look at the OmniSplat and OmniSplat+opt rows.
- Speed: OmniSplat generates a scene in 0.532 seconds. The optimization-based ODGS takes 1920 seconds (32 minutes).
- Quality: OmniSplat consistently achieves higher Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) scores than the perspective or omnidirectional adaptations of previous models.
The “OmniSplat+opt” variant involves taking the feed-forward prediction and running a tiny bit of optimization (just 100 iterations) on it. This yields state-of-the-art performance while still being drastically faster than full optimization methods.
Visual Comparison
The numbers look good, but visual inspection reveals the true difference.

In Figure 3 (above), look closely at the insets:
- ODGS (b) often produces blurred textures.
- PixelSplat (c) shows artifacts and geometric inconsistencies.
- MVSplat (d) suffers from severe darkening and stripe artifacts due to the sampling issues mentioned earlier.
- OmniSplat (e) produces sharp, clean geometry that closely matches the Ground Truth (a).
The Importance of Yin-Yang
Is the Yin-Yang grid really doing the heavy lifting? The researchers performed an ablation study to find out.

Table 2 clearly shows that using the Yin-Yang grid for both the encoder (attention) and the rasterizer produces the best results (bottom row). Mixing and matching (e.g., using a standard omnidirectional encoder with a Yin-Yang rasterizer) leads to a significant drop in quality, proving that the holistic use of this coordinate system is key.
Beyond Reconstruction: Segmentation and Editing
One of the most exciting implications of OmniSplat is its semantic capability. Because the model understands the correspondence between pixels in different views (via the attention maps), it can propagate segmentation labels across the 3D scene.

If a user selects an object in one view (marked by stars in the image above), OmniSplat can automatically identify that same object in a novel view. This is distinct from standard 2D tracking because it relies on the reconstructed 3D geometry.

As shown in Table 3, OmniSplat achieves matching accuracy comparable to or better than dedicated video segmentation trackers like DEVA, but it does so as a byproduct of the reconstruction process.
This allows for 3D Editing. You can select a chair in the 3D scene and delete it. Because OmniSplat generates pixel-aligned Gaussians, the removal is clean.

In Figure C, compare the removal quality. Optimization-based Gaussians (a) often leave behind “needle-like” artifacts because the Gaussians are stretched and overlapping in complex ways. OmniSplat’s pixel-aligned Gaussians (b) result in a clean cut, making it much easier to edit the scene or inpaint the empty space.
Conclusion
OmniSplat represents a significant step forward in 3D computer vision. It tackles the inherent difficulties of omnidirectional imagery—distortion and non-uniformity—not by training massive new models from scratch, but by cleverly remapping the data into the Yin-Yang grid.
This approach allows us to leverage the power of existing feed-forward networks, resulting in a system that is:
- Fast: Generating scenes in sub-second times.
- Accurate: Outperforming both optimization-based and adapted perspective methods.
- Editable: Providing robust segmentation for 3D scene manipulation.
For students and researchers in VR/AR, OmniSplat demonstrates that sometimes the solution isn’t a bigger neural network, but a better geometric representation of the data.
](https://deep-paper.org/en/paper/2412.16604/images/cover.png)