360° Vision: Solving Depth Estimation in the Wild with HELVIPAD

Introduction

Imagine you are a mobile robot tasked with navigating a crowded university campus. To move safely, you need to know exactly how far away every object is—not just the ones directly in front of you, but the pedestrians approaching from the side, the pillars behind you, and the walls curving around you. You need 360-degree spatial awareness.

Historically, the gold standard for this kind of omnidirectional sensing has been LiDAR (Light Detection and Ranging). LiDAR is precise and naturally covers 360 degrees. However, it is expensive, bulky, and the resulting point clouds become sparse at a distance. As a result, researchers have turned to stereo vision: using synchronized cameras to estimate depth, much like human eyes do.

But here lies the problem: standard stereo vision datasets are limited. They usually consist of a narrow field of view (like the view from a car windshield) or rely on synthetic, computer-generated environments that don’t capture the chaotic lighting and noise of the real world.

This article explores a significant step forward in robotic perception presented in the paper “HELVIPAD: A Real-World Dataset for Omnidirectional Stereo Depth Estimation.” The researchers introduce a comprehensive real-world dataset and a novel deep learning architecture designed to tackle the unique geometry of 360° images.

Examples from the HELVIPAD dataset showing indoor and outdoor scenes with corresponding depth maps.

As shown in Figure 1 above, the dataset captures complex environments—from sunlit plazas to dimly lit corridors—providing the rich data needed to train next-generation robots.

Background: The Challenge of 360° Stereo

Why Omnidirectional?

Conventional cameras have a limited Field of View (FoV). To see everything around a robot, you would need multiple cameras and complex stitching algorithms. An omnidirectional (or 360°) camera, typically using fisheye lenses, captures the entire sphere of the environment in a single shot. When projected onto a 2D plane, this is often represented as an equirectangular image—think of a world map where the north and south poles are stretched out across the top and bottom.

The Problem with Existing Data

Deep learning models thrive on data. For stereo depth estimation, models need pairs of images and a corresponding “ground truth” depth map to learn from. While datasets like KITTI have revolutionized autonomous driving, they only look forward. Synthetic datasets exist for 360° vision, but they lack the artifacts, lighting flares, and sensor noise of the real world.

Table comparing HELVIPAD to other popular stereo depth datasets.

As illustrated in Table 1, HELVIPAD fills a critical gap. It is the first real-world stereo dataset for omnidirectional images that includes pixel-wise depth labels across diverse indoor and outdoor settings, including challenging night scenes.

The HELVIPAD Dataset

The core contribution of this research is the dataset itself. Creating a high-quality dataset for 360° stereo is an engineering feat involving hardware design, sensor synchronization, and complex geometric mapping.

1. The Hardware Setup

To capture the world in 3D, the researchers built a custom rig. The setup uses a top-bottom camera configuration rather than the traditional left-right setup.

Diagram of the HELVIPAD data acquisition setup showing stacked cameras and LiDAR.

As seen in Figure 8, the rig consists of:

Two Ricoh Theta V cameras: Mounted vertically with a baseline (separation) of 19.1 cm. Vertical mounting is preferred for 360° robots because it prevents one camera from occluding the other’s view of the surroundings.
Ouster OS1-64 LiDAR: Mounted below the cameras to capture accurate depth measurements.
Nvidia Jetson Xavier: The brain handling synchronization and data capture.

2. Mapping LiDAR to 360° Images

The raw data consists of 360° images and 3D LiDAR point clouds. The challenge is “coloring” the pixels of the image with the depth values from the LiDAR.

Because the images are equirectangular (spherical projections), standard pinhole camera geometry doesn’t apply. The researchers had to map 3D points onto a sphere.

Diagram illustrating LiDAR to 360 image mapping and spherical disparity.

Figure 2 illustrates this geometry. A point \(P\) in the world is seen by both the top and bottom cameras. The “disparity” (the shift in the point’s position between the two views) is what allows us to calculate depth. However, in this spherical setup, we calculate spherical disparity—the angular difference between the rays pointing to the object.

The relationship between this angular disparity (\(d\)) and the depth (\(r\)) is governed by the following equation:

Equation for spherical disparity.

Here, \(\theta_b\) is the polar angle (vertical angle), \(r_{bottom}\) is the depth, and \(B_{camera}\) is the baseline distance between cameras. Note how the disparity depends on the cosine of the viewing angle—this introduces distortions specific to 360° imaging that don’t exist in standard stereo vision.

3. Validating the Projection

Ensuring the LiDAR points land exactly on the correct pixels is crucial. If the alignment is off, the neural network learns incorrect associations. The researchers validated their calibration by manually selecting points in the image (like corners of buildings) and comparing them to the projected LiDAR points.

Illustration of LiDAR point projection error on an image.

Figure 12 visualizes this validation. The red dots (projected LiDAR) are compared to green dots (actual image features), showing an average error of only about 8 pixels on high-resolution images, confirming the rig’s precision.

4. Depth Completion: Densifying the Data

LiDAR is accurate, but sparse. It produces a “cloud” of points, meaning most pixels in the camera image don’t have a corresponding depth value. If we only train on these sparse points, the model struggles to learn object boundaries.

To solve this, the researchers developed a Depth Completion Pipeline.

Temporal Aggregation: Because the LiDAR spins at 10Hz, they combine point clouds from previous and future frames to fill in gaps (assuming the robot moves slowly).
Interpolation: They estimate the depth of missing pixels by looking at their nearest neighbors on the spherical grid.
Filtering: To avoid “bleeding” depth across edges (e.g., blurring a tree into the sky behind it), they filter out points with high uncertainty or large variance.

Comparison of detail view: Depth completed map, original sparse map, and the RGB image.

The result, shown in Figure 17, is dramatic. The middle panel shows the raw LiDAR data—mostly empty space. The left panel shows the completed depth map, providing a dense, rich signal for training, while respecting object boundaries visible in the RGB image (right panel).

Adapting Stereo Matching for 360° Imaging

Standard stereo matching algorithms, like the popular IGEV-Stereo, are designed for rectilinear (flat) images. If you feed them equirectangular images, they fail because they don’t account for the severe distortion near the poles (top and bottom of the image) or the fact that the left edge of the image wraps around to the right edge.

The researchers propose 360-IGEV-Stereo, an adaptation of the state-of-the-art IGEV model specifically for this dataset.

Overview of the 360-IGEV-Stereo architecture.

Key Adaptation 1: Polar Angle Input

In a standard image, a pixel is just a pixel. In an equirectangular image, a pixel at the top represents a much smaller physical area than a pixel at the equator. The distortion is a function of the polar angle (vertical position).

To help the network understand this geometry, the researchers feed a Polar Angle Map as an extra input channel (seen in blue/green in Figure 4). This explicit geometric cue allows the network to adjust its matching filters based on how “stretched” the image is at that vertical level.

Key Adaptation 2: Circular Padding

Convolutional Neural Networks (CNNs) process images in patches. When the filter hits the right edge of a standard image, it typically pads the area with zeros (black pixels).

However, in a 360° image, the “right edge” is actually connected to the “left edge.” The researchers implemented Circular Padding. When the network processes the right edge, it “sees” pixels from the left edge, and vice versa. This ensures continuous depth estimation across the entire 360° view, eliminating seams.

Experiments and Results

The researchers benchmarked their new model against standard baselines (like PSMNet and the original IGEV) and existing omnidirectional models (360SD-Net).

Quantitative Performance

The results confirm that standard models struggle with 360° data and that the proposed adaptations are highly effective.

Comparative results table showing 360-IGEV-Stereo performance.

In Table 2, 360-IGEV-Stereo achieves the lowest error rates across almost all metrics.

MAE (Mean Absolute Error): Measures the average error in disparity. The proposed model drops this from 0.225 (IGEV-Stereo) to 0.188.
LRCE (Left-Right Consistency Error): This metric is crucial for 360° images. It measures how much the prediction at the left edge disagrees with the right edge. Thanks to circular padding, 360-IGEV-Stereo reduces this error drastically (from 1.203 to 0.388), proving the model understands the continuous nature of the scene.

Qualitative Analysis

Numbers are one thing, but visual disparity maps tell the real story.

Visualization of disparity predictions comparing baseline and the new model.

Figure 7 compares the outputs. Look closely at the rightmost column (a night scene). The standard IGEV-Stereo (second row from bottom) misses the pedestrians entirely or blurs them out. The 360-IGEV-Stereo (bottom row) captures the pedestrians with much sharper definition. Similarly, in the indoor scene (left column), the 360 model handles the pillars and walls with smoother gradients and fewer artifacts.

Cross-Scene Generalization

One of the hardest challenges in robotics is generalization—training a robot in a hallway and expecting it to work in a parking lot at night.

Bar charts showing cross-scene generalization performance.

Figure 6 shows a breakdown of performance by scene type.

Blue bars: Trained only indoors.
Orange bars: Trained only outdoors.
Green bars: Trained on everything.

Unsurprisingly, models trained on “All” data perform best. However, an interesting finding is that omnidirectional models (like 360SD-Net and 360-IGEV-Stereo) generalize better than standard ones. The inherent global context of 360° vision seems to help the model learn more robust features that transfer between indoor and outdoor environments.

Conclusion

The HELVIPAD dataset represents a foundational resource for the robotics and computer vision communities. By providing high-quality, dense, real-world data, it enables the training of models that can actually function in dynamic human environments.

Furthermore, the 360-IGEV-Stereo architecture demonstrates that we don’t need to reinvent the wheel for 360° vision. By making targeted geometric adaptations—specifically polar angle inputs and circular padding—we can leverage powerful modern stereo matching networks to handle the unique distortions of omnidirectional imagery.

As mobile robots become more common in our daily lives, from delivery bots to automated wheelchairs, technologies like these will be the eyes that ensure they navigate safely and intelligently.

Introduction#

Background: The Challenge of 360° Stereo#

Why Omnidirectional?#

The Problem with Existing Data#

The HELVIPAD Dataset#

1. The Hardware Setup#

2. Mapping LiDAR to 360° Images#

3. Validating the Projection#

4. Depth Completion: Densifying the Data#

Adapting Stereo Matching for 360° Imaging#

Key Adaptation 1: Polar Angle Input#

Key Adaptation 2: Circular Padding#

Experiments and Results#

Quantitative Performance#

Qualitative Analysis#

Cross-Scene Generalization#

Conclusion#