TPVFormer: Reconstructing a 3D World from 2D Snapshots with Tri-Perspective View

For an autonomous vehicle to navigate our chaotic world, it needs more than just GPS and rules—it must see and understand its surroundings in rich, 3D detail. Beyond detecting cars and pedestrians, it should recognize the space they occupy, the terrain’s contours, the location of sidewalks, and the canopy of trees overhead. This is the essence of 3D Semantic Occupancy Prediction: building a complete, labeled 3D map of the environment.

Traditionally, LiDAR has been the go-to technology for this task. LiDAR sensors emit laser beams to directly capture 3D point clouds of the surroundings. However, LiDAR can be expensive, and its data might be sparse, especially at greater distances or in occluded areas. Cameras, on the other hand, are cheap, ubiquitous, and capture rich texture and color information that LiDAR misses. The challenge? A 2D image is a flattened slice of 3D reality—recovering the missing dimension reliably is a hard problem.

The recent paper “Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction” proposes a novel way to overcome this challenge. The authors introduce a new 3D representation called the Tri-Perspective View (TPV) and an efficient transformer-based model, TPVFormer, to construct it. The results are striking: using only camera images, TPVFormer can generate dense and accurate 3D semantic maps—sometimes more comprehensive than the sparse LiDAR data used in training.

A diagram showing how TPVFormer takes surround-camera RGB images as input and predicts a detailed 3D semantic occupancy map, which is then compared to the ground truth. The predicted map shows impressive detail for various objects like cars, vegetation, and drivable surfaces.

Figure 1: Vision-based occupancy prediction pipeline. TPVFormer takes RGB inputs and predicts semantic occupancy for all voxels using sparse LiDAR supervision.

In this post, we’ll explore the motivation behind TPV, break down its design, unpack the TPVFormer architecture, and examine the experimental results suggesting vision may be powerful enough to rival LiDAR for full-scene understanding.

Representing 3D Space: The Existing Approaches

To appreciate the novelty of TPV, let’s first look at two common scene representation methods used in autonomous driving: Voxel Grids and Bird’s-Eye-View (BEV) maps.

Voxel Grids: The 3D Pixel Approach

The most straightforward way to represent a 3D scene is to split it into uniform cubes—voxels. Each voxel stores a feature vector describing its contents: empty space, part of a car, a tree, road, etc.

Voxel grids are expressive and preserve full 3D detail. The feature for a point \((x, y, z)\) is the feature of the voxel containing that point:

\[ \mathbf{f}_{x,y,z} = \mathbf{v}_{h,w,d} = \mathcal{S}(\mathbf{V}, \mathcal{P}_{vox}(x, y, z)) \]

However, voxel grids are computationally expensive. A 100m × 100m × 8m scene at 20cm resolution spans \(500 \times 500 \times 40 = 10{,}000{,}000\) voxels. The cubic complexity, \(O(HWD)\), makes high-resolution voxels impractical for real-time autonomous driving.

Bird’s-Eye-View (BEV): Flattening the World

Given that most driving-relevant objects are near ground level, many systems collapse the height dimension and work with BEV: a top-down grid of the scene.

In BEV, each 2D cell represents a vertical “pillar” covering all heights. The point \((x,y,z)\) maps to BEV feature \(\mathbf{b}_{h,w}\):

\[ \mathbf{f}_{x,y,z} = \mathbf{b}_{h,w} = \mathcal{S}(\mathbf{B}, \mathcal{P}_{bev}(x, y)) \]

BEV reduces complexity to \(O(HW)\) and works well for tasks like detecting ground-level bounding boxes. But it discards height information—a tree branch and the ground beneath end up in the same cell—limiting fine-grained understanding.

The question then arises: can we retain voxel-level expressiveness while keeping BEV’s efficiency?

A comparison of Voxel, BEV, and the proposed TPV representations. The Voxel grid is a full 3D cube, BEV is a single 2D ground plane, and TPV uses three orthogonal planes to represent the 3D space around a car.

Figure 3: Voxel grids capture full detail but are expensive; BEV is efficient but loses height context. TPV introduces three orthogonal planes for a balanced representation.

Tri-Perspective View (TPV): Three Windows into the Scene

The authors’ solution is to represent the 3D scene using three orthogonal feature planes:

Top-down (HW-plane) — like a BEV map.
Side-view (DH-plane) — looking from the driver’s side.
Front-view (WD-plane) — looking from the car’s front.

Formally:

\[ \mathbf{T} = [\mathbf{T}^{HW}, \mathbf{T}^{DH}, \mathbf{T}^{WD}] \]

where:

\(\mathbf{T}^{HW} \in \mathbb{R}^{H \times W \times C}\)
\(\mathbf{T}^{DH} \in \mathbb{R}^{D \times H \times C}\)
\(\mathbf{T}^{WD} \in \mathbb{R}^{W \times D \times C}\)

The storage complexity becomes \(O(HW + DH + WD)\), still quadratic.

To get the feature for a point \((x,y,z)\):

Project: Onto each plane — \((h,w)\), \((d,h)\), \((w,d)\) coordinates.
Sample: Use bilinear interpolation to sample features \(\mathbf{t}_{h,w}\), \(\mathbf{t}_{d,h}\), \(\mathbf{t}_{w,d}\).
Aggregate: Sum them: \[ \mathbf{f}_{x,y,z} = \mathcal{A}(\mathbf{t}_{h,w}, \mathbf{t}_{d,h}, \mathbf{t}_{w,d}) \]

This way, TPV assigns unique features to every 3D point—recovering voxel-like expressiveness with high efficiency.

TPVFormer: Lifting Vision into TPV Space

TPV is powerful, but how do we populate those planes with features from 2D camera images? Enter TPVFormer, a transformer-based encoder crafted for this.

The overall architecture of TPVFormer. It takes multi-camera image inputs, passes them through a backbone to extract features, applies cross-attention into TPV planes, and fuses across views. Finally, a prediction head produces semantic occupancy.

Figure 4: TPVFormer architecture: multi-camera backbone, Image Cross-Attention to lift 2D features, Cross-View Hybrid Attention to fuse planes, and a prediction head.

TPV Queries

TPVFormer defines a query for each grid cell on each plane—learnable parameters standing in for “placeholders” of spatial regions in the scene.

Image Cross-Attention (ICA): Lifting 2D to 3D

For each TPV query:

Identify its real-world pillar region.
Sample reference points along this pillar (in 3D).
Project those points into each surround camera’s image using calibration.
Use deformable attention to efficiently sample nearby image features and aggregate them into the query.

Mathematically:

\[ \operatorname{ICA}(\mathbf{t}_{h,w}, \mathbf{I}) = \frac{1}{|N_{h,w}^{val}|} \sum_{j \in N_{h,w}^{val}} \operatorname{DA}(\mathbf{t}_{h,w}, \mathbf{Ref}_{h,w}^{pix,j}, \mathbf{I}_j) \]

Here \(DA\) is deformable attention, aggregating over valid cameras \(N_{h,w}^{val}\).

Cross-View Hybrid-Attention (CVHA): Fusing Perspectives

After ICA, each plane has independent information. CVHA lets queries attend to corresponding regions in other planes—top-view can gather height cues from side-view, etc.—yielding coherent 3D context.

TPVFormer stacks multiple blocks:

Early: Hybrid-Cross-Attention Blocks (ICA + CVHA).
Later: Hybrid-Attention Blocks (CVHA only) for refinement.

The result: three semantically rich, consistent planes.

Experiments: Seeing and Measuring TPVFormer

The authors tested TPVFormer on two datasets—nuScenes and SemanticKITTI—across three tasks:

1. Vision-Based 3D Semantic Occupancy Prediction

TPVFormer is trained using sparse per-point LiDAR labels, with only camera images as input. At test time, it predicts labels for all voxels.

An illustration of the training and testing pipeline. Model uses TPVFormer to create TPV planes, trained with sparse LiDAR supervision, and makes dense predictions at test time.

Figure 2: Training with sparse LiDAR labels, testing with dense semantic occupancy prediction.

Qualitative Results

The predictions are dense, coherent, and geometrically accurate—often capturing objects missed in sparse LiDAR.

Qualitative comparison on four street scenes: camera inputs and corresponding dense semantic predictions. TPVFormer identifies cars, sidewalks, vegetation, and rare classes like bicycles.

Figure 5: Dense TPVFormer predictions showing fine detail beyond sparse LiDAR ground truth.

An appealing property: TPV can be queried at arbitrary resolutions without retraining. Higher resolutions yield finer detail:

Comparisons at resolutions from 50×50×4 to 400×400×32; high resolution reveals sharper object shapes.

Figure 6: Predicting at higher resolutions captures finer geometry on objects like cars and trucks.

2. LiDAR Segmentation (nuScenes)

Here, the task is to predict semantic labels for given LiDAR points—no LiDAR input is used at inference, only image-based queries.

Table 1: TPVFormer-Base (camera-only) reaches 69.4 mIoU, on par with LiDAR-based PolarNet.

Table 1: LiDAR segmentation on nuScenes test. TPVFormer achieves competitiveness with LiDAR-based models using cameras only.

TPVFormer-Base’s 69.4 mIoU matches PolarNet’s LiDAR-based performance—a first for a vision-only method. An ablation shows TPV outperforms BEVFormer by a wide margin (68.9 vs 56.2 on validation), underscoring TPV’s advantage.

Table 4: TPVFormer consistently beats BEVFormer at similar resolutions, validating the three-plane approach.

Table 4: Resolution/feature dimension ablation—TPVFormer consistently surpasses BEVFormer.

3. Semantic Scene Completion (SemanticKITTI)

Dense voxelized predictions of occupancy and semantic labels from RGB inputs.

Table 2: TPVFormer achieves state-of-the-art mIoU (11.26) among camera-based methods, outperforming MonoScene while being more efficient.

Table 2: Semantic scene completion results—TPVFormer sets a new camera-based benchmark.

TPVFormer reaches 11.26 mIoU, beating MonoScene’s 11.08 while using fewer parameters (6.0M vs 15.7M) and less compute (128G vs 500G FLOPS).

Conclusion: A Leap Toward Vision-First 3D Perception

Dense 3D semantic occupancy prediction from six camera views; ego vehicle shown at center.

Figure 7: Prediction visualization: TPVFormer models every voxel around the ego vehicle from camera inputs.

Key takeaways:

TPV Hits the Sweet Spot: Matches voxel expressiveness while approaching BEV efficiency, enabling fine detail without cubic cost.
Vision Can Rival LiDAR: TPVFormer’s camera-only performance rivals strong LiDAR baselines, opening the door to cheaper, more versatile perception.
Sparse-to-Dense Generalization: Trained on sparse labels, TPVFormer predicts dense, high-quality occupancy—vital for real-world scalability.

TPVFormer shows that rethinking 3D representation unlocks powerful vision-based perception. By combining three complementary perspectives, TPV builds a more complete picture—pointing toward a future where autonomous cars see as richly in 3D using cameras as we do with our own eyes.

Representing 3D Space: The Existing Approaches#

Voxel Grids: The 3D Pixel Approach#

Bird’s-Eye-View (BEV): Flattening the World#

Tri-Perspective View (TPV): Three Windows into the Scene#

TPVFormer: Lifting Vision into TPV Space#

TPV Queries#

Image Cross-Attention (ICA): Lifting 2D to 3D#

Cross-View Hybrid-Attention (CVHA): Fusing Perspectives#

Experiments: Seeing and Measuring TPVFormer#

1. Vision-Based 3D Semantic Occupancy Prediction#

Qualitative Results#

2. LiDAR Segmentation (nuScenes)#

3. Semantic Scene Completion (SemanticKITTI)#

Conclusion: A Leap Toward Vision-First 3D Perception#