TacoDepth: Breaking the Speed Limit in Radar-Camera Depth Estimation

In the rapidly evolving world of autonomous driving and robotics, perception is everything. Vehicles need to know not just what is around them, but exactly how far away it is. While LiDAR sensors provide excellent depth data, they are expensive. A more cost-effective alternative is fusing data from cameras (rich visual detail) and mmWave Radar (reliable depth and velocity).

However, Radar-Camera fusion has a major bottleneck: efficiency. Existing methods are often slow and computationally heavy, relying on complex, multi-stage processes that act like stumbling blocks for real-time applications.

Enter TacoDepth, a novel framework proposed by researchers from Nanyang Technological University, Huazhong University of Science and Technology, and SenseTime Research. By treating Radar points as graphs and utilizing a clever “Flash Attention” mechanism, TacoDepth achieves state-of-the-art accuracy while running significantly faster than previous methods.

In this deep dive, we will unpack how TacoDepth works, why it abandons the traditional multi-stage approach, and how it achieves a balance of speed and precision that could change how robots perceive the world.

The Challenge: The “Sparsity” Trap

To understand why TacoDepth is necessary, we first need to understand the problem with Radar. Unlike the dense point clouds produced by LiDAR (which can look like a 3D scan of the world), Radar data is incredibly sparse. A standard automotive Radar might return only a handful of points for an entire scene.

Traditionally, researchers have tried to solve this by “filling in the blanks” before doing anything else. This is known as a multi-stage framework.

Stage 1: The model takes the sparse Radar points and tries to expand them into an intermediate “quasi-dense” depth map.
Stage 2: This intermediate map is then fused with the camera image to predict the final dense depth.

The problem? This intermediate step is a trap. The “quasi-dense” maps are often full of noise and errors. If the model guesses wrong in Stage 1, those errors propagate to Stage 2, ruining the final result. Furthermore, generating these intermediate maps is computationally expensive.

Comparison of performance and efficiency. The bubble chart shows TacoDepth (Ours) in the bottom left, indicating it has the lowest error (MAE/RMSE) and the fastest inference time (smallest circle) compared to methods like DORN, CaFNet, and RadarCam-Depth.

As shown in the figure above, TacoDepth (represented by the small blue and green circles) operates in a different league compared to prior arts. It achieves lower error rates while running at much faster speeds.

Visualizing the Problem

Why is the multi-stage approach so brittle? Look at the quality of the intermediate depth maps produced by traditional methods:

Visualization of intermediate quasi-dense depth maps. The red boxes highlight how sparse the data remains, with vast black areas of missing information.

In the image above, the “Quasi-dense Depth Maps” are mostly empty (black). In challenging conditions like nighttime or glare, valid pixels are scarce. Relying on such sparse, noisy intermediate data is the primary reason previous models struggle with robustness and speed.

The Solution: TacoDepth

The researchers propose a One-stage Fusion approach. Instead of trying to create a fake, intermediate depth map, TacoDepth directly extracts geometric structures from the Radar data and fuses them with the image features in a single, streamlined pass.

The architecture consists of two main novelties:

Graph-based Radar Structure Extractor: Moving beyond simple point coordinates.
Pyramid-based Radar Fusion Module: Efficiently mixing modalities using Flash Attention.

Let’s look at the high-level architecture:

Overview of the TacoDepth architecture. Left: Radar points are processed as graphs. Middle: A pyramid fusion module merges image and radar features at multiple scales. Right: Detail on the Radar-centered Flash Attention.

1. Radar as a Graph, Not Just Points

Most previous methods treat Radar data as a simple list of points with \((x, y, z)\) coordinates. TacoDepth takes a more sophisticated approach by viewing the Radar point cloud as a Graph.

In this graph-based extractor:

Nodes are the Radar points.
Edges represent the relationships (distances and topology) between neighboring points.

Using Graph Neural Networks (GNNs), the model captures the geometric structure of the scene. It doesn’t just know where a point is; it understands the local topology—how points relate to each other. This global structural awareness makes the model much more robust to outliers (noise) than methods that look at points in isolation.

2. Pyramid-based Radar Fusion

Once the Radar features (nodes and edges) are extracted, they need to be combined with the camera image. TacoDepth does this hierarchically using a “Pyramid” structure.

Shallow Layers: Fuse image details with Radar coordinates.
Deep Layers: Fuse scene semantics (e.g., “this is a car”) with geometric structures.

This ensures that the model utilizes both the fine-grained texture of the image and the high-level understanding of the scene geometry.

3. The Secret Weapon: Radar-Centered Flash Attention

The most critical innovation for efficiency in TacoDepth is how it handles Attention.

In modern deep learning (specifically Transformers), “Attention” is a mechanism that helps the model decide which parts of the input are related. Usually, calculating attention is expensive because every pixel is compared to every other pixel or point.

The authors realized two things:

Sparsity: Radar points are sparse, so comparing them to every pixel in the image is a waste of computation.
Coordinate Reliability: Radar is very accurate horizontally (azimuth) but often inaccurate vertically (elevation) due to the lack of vertical antennas in many units.

To solve this, they introduced Radar-centered Flash Attention.

How it Works

For a specific Radar point, the model only calculates attention with image pixels that are physically close to it in the horizontal direction. It creates a “window” around the Radar point.

Mathematically, if we have an image feature map \(F_{2l}\), we only keep pixels \(m\) whose horizontal coordinate \(x_m\) is within a distance \(a_l\) of the Radar point \(x_p\):

Equation 1: Selecting image pixels within a horizontal window around the radar point.

Similarly, when looking from the perspective of an image pixel, the model only considers Radar points that fall within that same horizontal window:

Equation 2: Selecting radar edges within a horizontal window around the image pixel.

By restricting the search area, the computational cost drops, and the model avoids being confused by irrelevant pixels far away from the Radar return.

The attention is then computed using a highly optimized “Flash Attention” mechanism, which speeds up the memory operations on the GPU:

Equation 3: Calculation of the attention score using Softmax on the windowed queries, keys, and values.

Seeing Attention in Action

Does this restricted attention actually work? The visualization below confirms it does. Even with the horizontal constraint, the attention maps (the heatmaps in the bottom row) clearly focus on the relevant objects, such as cars and poles, distinguishing them from the background.

Visualization of attention maps. The model focuses on specific objects (like cars or trees) corresponding to the radar points, ignoring the background.

This selective focus is what allows TacoDepth to be both fast (ignoring irrelevant data) and accurate (focusing on the right data).

Flexible Inference: One Model, Two Modes

One of the most user-friendly features of TacoDepth is its flexibility. The researchers designed it to operate in two distinct modes, depending on the user’s needs for speed versus accuracy.

The inference process is defined by this equation:

Equation 7: The inference function of TacoDepth, showing the optional input D* (initial depth).

Here, \(D^*\) represents an optional “initial depth” map.

Independent Mode (Speed): The model runs using only the raw Image (\(I\)) and Radar (\(P\)). This is the fastest method, achieving over 37 frames per second (FPS), making it ideal for real-time robotic navigation.
Plug-in Mode (Accuracy): The model takes an initial depth estimation from a separate, pre-trained network (like MiDaS or Depth-Anything) and uses the Radar data to “correct” the scale. This provides the highest possible accuracy but is slower due to the extra processing steps.

Experimental Results

The researchers tested TacoDepth on the standard nuScenes dataset and the newer ZJU-4DRadarCam dataset. The results were compelling.

Speed and Efficiency

We saw the bubble chart in the introduction, but let’s look at the raw numbers regarding efficiency.

Table 3: Model efficiency metrics. TacoDepth has significantly fewer parameters and FLOPs compared to competitors like Singh et al. and RadarCam-Depth.

TacoDepth is a lightweight powerhouse.

Parameters: It uses only 13.47M parameters in Independent mode, compared to 22.81M for the previous state-of-the-art (Singh et al.).
FLOPs (Floating Point Operations): It requires drastically less computation—139 GFLOPs versus 502 GFLOPs for Singh et al.
Speed: It processes a frame in 26.7ms, whereas Singh et al. takes 94.2ms. That is nearly 3.5x faster.

This is confirmed by the FPS analysis across varying densities of Radar points:

Figure 9: FPS analysis. The model maintains real-time performance (above 25 FPS) even as the number of radar points increases.

Visual Quality and Robustness

Speed is useless without accuracy. TacoDepth shines here as well, particularly in difficult conditions.

Daytime Scenes: In standard daylight, TacoDepth produces sharp depth maps with clear object boundaries. Notice the crisp definition of the cars in the bottom row (“Ours”) compared to the blurry artifacts in the middle rows.

Visual comparison on daytime scenes. TacoDepth (bottom) shows sharper object boundaries than Singh et al. (middle).

Nighttime Scenes: Nighttime is where cameras struggle and Radar becomes essential. Previous multi-stage methods often fail here because their intermediate depth maps are empty (darkness yields no visual features for quasi-dense estimation). TacoDepth, however, leverages the structural graph of the Radar data to maintain integrity.

Visual comparison on nighttime scenes. Previous methods fail to reconstruct the scene structure, while TacoDepth maintains clear depth perception.

In the image above, the competing methods (middle rows) essentially collapse, producing muddy, indistinct blobs. TacoDepth (bottom row) clearly resolves the street structure and vehicles.

Robustness to “Height Ambiguity”

A common issue with 3D Radar is height ambiguity—the sensor might know an object is 10 meters away, but it’s unsure if it’s at ground level or 2 meters up.

TacoDepth’s attention mechanism is surprisingly resilient to this. The researchers simulated this error by artificially perturbing the vertical coordinates of radar points.

Figure 8: Robustness tests. (a) Even when radar points are vertically shifted (squares vs circles), the attention map finds the correct object. (b) The model successfully ignores outliers (orange triangle).

As seen in Figure 8(a), even when the radar point is shifted vertically (the square), the attention map (heatmap) still locks onto the correct object (the car or tree). In Figure 8(b), the model correctly ignores a “ghost” radar point in the sky (an outlier), assigning it very low attention.

Comparison with State-of-the-Art

The quantitative results seal the deal. On the ZJU-4DRadarCam dataset, TacoDepth outperforms existing “Plug-in” models regardless of which backbone depth predictor is used.

Table 4: Comparison of Plug-in models. TacoDepth yields lower error rates (MAE) than RadarCam-Depth across different backbones like MiDaS and Depth-Anything.

Whether using DPT, MiDaS, or the cutting-edge Depth-Anything-v2, TacoDepth consistently provides a better mechanism for fusing that initial depth with Radar data to achieve metric accuracy.

Conclusion

TacoDepth represents a significant shift in how we approach sensor fusion for depth estimation. By rejecting the complex, error-prone multi-stage pipelines of the past and embracing a streamlined one-stage architecture, the authors have created a system that is:

Efficient: Utilizing graph structures and horizontally-constrained Flash Attention to run at real-time speeds.
Accurate: Outperforming state-of-the-art methods in standard metrics.
Robust: Handling nighttime scenes and sensor noise with resilience that previous models lacked.
Flexible: Offering modes for both high-speed robotics and high-fidelity mapping.

The name “TacoDepth” is fitting—just as a taco wraps various ingredients into a single, cohesive package, this framework wraps the disparate modalities of Radar and Camera into a unified, efficient perception model. For autonomous vehicles operating in the messy, unpredictable real world, this kind of efficient fusion is exactly what is needed to ensure safety and reliability.

The Challenge: The “Sparsity” Trap#

Visualizing the Problem#

The Solution: TacoDepth#

1. Radar as a Graph, Not Just Points#

2. Pyramid-based Radar Fusion#

3. The Secret Weapon: Radar-Centered Flash Attention#

How it Works#

Seeing Attention in Action#

Flexible Inference: One Model, Two Modes#

Experimental Results#

Speed and Efficiency#

Visual Quality and Robustness#

Robustness to “Height Ambiguity”#

Comparison with State-of-the-Art#

Conclusion#