For an autonomous vehicle to navigate our chaotic world, it needs more than just GPS and rules—it must see and understand its surroundings in rich, 3D detail. Beyond detecting cars and pedestrians, it should recognize the space they occupy, the terrain’s contours, the location of sidewalks, and the canopy of trees overhead. This is the essence of 3D Semantic Occupancy Prediction: building a complete, labeled 3D map of the environment.
Traditionally, LiDAR has been the go-to technology for this task. LiDAR sensors emit laser beams to directly capture 3D point clouds of the surroundings. However, LiDAR can be expensive, and its data might be sparse, especially at greater distances or in occluded areas. Cameras, on the other hand, are cheap, ubiquitous, and capture rich texture and color information that LiDAR misses. The challenge? A 2D image is a flattened slice of 3D reality—recovering the missing dimension reliably is a hard problem.
The recent paper “Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction” proposes a novel way to overcome this challenge. The authors introduce a new 3D representation called the Tri-Perspective View (TPV) and an efficient transformer-based model, TPVFormer, to construct it. The results are striking: using only camera images, TPVFormer can generate dense and accurate 3D semantic maps—sometimes more comprehensive than the sparse LiDAR data used in training.
Figure 1: Vision-based occupancy prediction pipeline. TPVFormer takes RGB inputs and predicts semantic occupancy for all voxels using sparse LiDAR supervision.
In this post, we’ll explore the motivation behind TPV, break down its design, unpack the TPVFormer architecture, and examine the experimental results suggesting vision may be powerful enough to rival LiDAR for full-scene understanding.
Representing 3D Space: The Existing Approaches
To appreciate the novelty of TPV, let’s first look at two common scene representation methods used in autonomous driving: Voxel Grids and Bird’s-Eye-View (BEV) maps.
Voxel Grids: The 3D Pixel Approach
The most straightforward way to represent a 3D scene is to split it into uniform cubes—voxels. Each voxel stores a feature vector describing its contents: empty space, part of a car, a tree, road, etc.
Voxel grids are expressive and preserve full 3D detail. The feature for a point \((x, y, z)\) is the feature of the voxel containing that point:
\[ \mathbf{f}_{x,y,z} = \mathbf{v}_{h,w,d} = \mathcal{S}(\mathbf{V}, \mathcal{P}_{vox}(x, y, z)) \]However, voxel grids are computationally expensive. A 100m × 100m × 8m scene at 20cm resolution spans \(500 \times 500 \times 40 = 10{,}000{,}000\) voxels. The cubic complexity, \(O(HWD)\), makes high-resolution voxels impractical for real-time autonomous driving.
Bird’s-Eye-View (BEV): Flattening the World
Given that most driving-relevant objects are near ground level, many systems collapse the height dimension and work with BEV: a top-down grid of the scene.
In BEV, each 2D cell represents a vertical “pillar” covering all heights. The point \((x,y,z)\) maps to BEV feature \(\mathbf{b}_{h,w}\):
\[ \mathbf{f}_{x,y,z} = \mathbf{b}_{h,w} = \mathcal{S}(\mathbf{B}, \mathcal{P}_{bev}(x, y)) \]BEV reduces complexity to \(O(HW)\) and works well for tasks like detecting ground-level bounding boxes. But it discards height information—a tree branch and the ground beneath end up in the same cell—limiting fine-grained understanding.
The question then arises: can we retain voxel-level expressiveness while keeping BEV’s efficiency?
Figure 3: Voxel grids capture full detail but are expensive; BEV is efficient but loses height context. TPV introduces three orthogonal planes for a balanced representation.
Tri-Perspective View (TPV): Three Windows into the Scene
The authors’ solution is to represent the 3D scene using three orthogonal feature planes:
- Top-down (HW-plane) — like a BEV map.
- Side-view (DH-plane) — looking from the driver’s side.
- Front-view (WD-plane) — looking from the car’s front.
Formally:
\[ \mathbf{T} = [\mathbf{T}^{HW}, \mathbf{T}^{DH}, \mathbf{T}^{WD}] \]where:
- \(\mathbf{T}^{HW} \in \mathbb{R}^{H \times W \times C}\)
- \(\mathbf{T}^{DH} \in \mathbb{R}^{D \times H \times C}\)
- \(\mathbf{T}^{WD} \in \mathbb{R}^{W \times D \times C}\)
The storage complexity becomes \(O(HW + DH + WD)\), still quadratic.
To get the feature for a point \((x,y,z)\):
- Project: Onto each plane — \((h,w)\), \((d,h)\), \((w,d)\) coordinates.
- Sample: Use bilinear interpolation to sample features \(\mathbf{t}_{h,w}\), \(\mathbf{t}_{d,h}\), \(\mathbf{t}_{w,d}\).
- Aggregate: Sum them: \[ \mathbf{f}_{x,y,z} = \mathcal{A}(\mathbf{t}_{h,w}, \mathbf{t}_{d,h}, \mathbf{t}_{w,d}) \]
This way, TPV assigns unique features to every 3D point—recovering voxel-like expressiveness with high efficiency.
TPVFormer: Lifting Vision into TPV Space
TPV is powerful, but how do we populate those planes with features from 2D camera images? Enter TPVFormer, a transformer-based encoder crafted for this.
Figure 4: TPVFormer architecture: multi-camera backbone, Image Cross-Attention to lift 2D features, Cross-View Hybrid Attention to fuse planes, and a prediction head.
TPV Queries
TPVFormer defines a query for each grid cell on each plane—learnable parameters standing in for “placeholders” of spatial regions in the scene.
Image Cross-Attention (ICA): Lifting 2D to 3D
For each TPV query:
- Identify its real-world pillar region.
- Sample reference points along this pillar (in 3D).
- Project those points into each surround camera’s image using calibration.
- Use deformable attention to efficiently sample nearby image features and aggregate them into the query.
Mathematically:
\[ \operatorname{ICA}(\mathbf{t}_{h,w}, \mathbf{I}) = \frac{1}{|N_{h,w}^{val}|} \sum_{j \in N_{h,w}^{val}} \operatorname{DA}(\mathbf{t}_{h,w}, \mathbf{Ref}_{h,w}^{pix,j}, \mathbf{I}_j) \]Here \(DA\) is deformable attention, aggregating over valid cameras \(N_{h,w}^{val}\).
Cross-View Hybrid-Attention (CVHA): Fusing Perspectives
After ICA, each plane has independent information. CVHA lets queries attend to corresponding regions in other planes—top-view can gather height cues from side-view, etc.—yielding coherent 3D context.
TPVFormer stacks multiple blocks:
- Early: Hybrid-Cross-Attention Blocks (ICA + CVHA).
- Later: Hybrid-Attention Blocks (CVHA only) for refinement.
The result: three semantically rich, consistent planes.
Experiments: Seeing and Measuring TPVFormer
The authors tested TPVFormer on two datasets—nuScenes and SemanticKITTI—across three tasks:
1. Vision-Based 3D Semantic Occupancy Prediction
TPVFormer is trained using sparse per-point LiDAR labels, with only camera images as input. At test time, it predicts labels for all voxels.
Figure 2: Training with sparse LiDAR labels, testing with dense semantic occupancy prediction.
Qualitative Results
The predictions are dense, coherent, and geometrically accurate—often capturing objects missed in sparse LiDAR.
Figure 5: Dense TPVFormer predictions showing fine detail beyond sparse LiDAR ground truth.
An appealing property: TPV can be queried at arbitrary resolutions without retraining. Higher resolutions yield finer detail:
Figure 6: Predicting at higher resolutions captures finer geometry on objects like cars and trucks.
2. LiDAR Segmentation (nuScenes)
Here, the task is to predict semantic labels for given LiDAR points—no LiDAR input is used at inference, only image-based queries.
Table 1: LiDAR segmentation on nuScenes test. TPVFormer achieves competitiveness with LiDAR-based models using cameras only.
TPVFormer-Base’s 69.4 mIoU matches PolarNet’s LiDAR-based performance—a first for a vision-only method. An ablation shows TPV outperforms BEVFormer by a wide margin (68.9 vs 56.2 on validation), underscoring TPV’s advantage.
Table 4: Resolution/feature dimension ablation—TPVFormer consistently surpasses BEVFormer.
3. Semantic Scene Completion (SemanticKITTI)
Dense voxelized predictions of occupancy and semantic labels from RGB inputs.
Table 2: Semantic scene completion results—TPVFormer sets a new camera-based benchmark.
TPVFormer reaches 11.26 mIoU, beating MonoScene’s 11.08 while using fewer parameters (6.0M vs 15.7M) and less compute (128G vs 500G FLOPS).
Conclusion: A Leap Toward Vision-First 3D Perception
Figure 7: Prediction visualization: TPVFormer models every voxel around the ego vehicle from camera inputs.
Key takeaways:
- TPV Hits the Sweet Spot: Matches voxel expressiveness while approaching BEV efficiency, enabling fine detail without cubic cost.
- Vision Can Rival LiDAR: TPVFormer’s camera-only performance rivals strong LiDAR baselines, opening the door to cheaper, more versatile perception.
- Sparse-to-Dense Generalization: Trained on sparse labels, TPVFormer predicts dense, high-quality occupancy—vital for real-world scalability.
TPVFormer shows that rethinking 3D representation unlocks powerful vision-based perception. By combining three complementary perspectives, TPV builds a more complete picture—pointing toward a future where autonomous cars see as richly in 3D using cameras as we do with our own eyes.