Beyond the One True Path: How BranchOut Revolutionizes Multimodal Autonomous Driving

Imagine you are driving down a busy street and you see a delivery truck stopped on the shoulder. As a human driver, what do you do? Do you nudge slightly into the next lane? Do you slow down and wait? Do you perform a full lane change to overtake?

The answer is likely “it depends,” and crucially, all of those options might be valid and safe.

This variety of valid choices is called multimodality. However, for a long time, autonomous driving (AV) models have been trained and evaluated as if there is only one correct answer: the exact path the human driver took in the training data. This leads to rigid, robotic driving that fails to capture the nuance of real-world interactions.

In this post, we are diving deep into BranchOut, a new research paper from Boston University. The researchers introduce a novel end-to-end planner that combines Diffusion models with Gaussian Mixture Models (GMMs) to capture diverse, human-like behaviors. Furthermore, they expose a flaw in how we currently grade AVs and propose a new “Human-in-the-Loop” benchmark to fix it.

The Problem: The “Single Future” Fallacy

Standard motion planning works by looking at the current scene—cameras, LiDAR, maps—and predicting where the car should go. The model is usually trained on a dataset like nuScenes, where every scenario has exactly one ground-truth trajectory (the path the data-collection car actually took).

The problem arises when we evaluate models based solely on how closely they match that single trajectory. If an AV plans a safe lane change, but the original human driver stayed in their lane, the AV is penalized for a “high error.”

This encourages models to “play it safe” by averaging all possibilities, often leading to mode collapse. Instead of choosing one distinct maneuver (go left or go right), the model might predict a path straight down the middle—which could be directly into an obstacle.

Figure 1: Capturing Multimodality in Complex Real-World Driving Scenarios. We study modeling and evaluation of intricate multimodal driving settings, including subtle interactions, e.g., vehicle-vehicle and human-vehicle. As an example, we visualize collected multimodal trajectories in two scenarios: navigating around a vehicle parked on the shoulder (left) and interacting with a dynamic agent at an intersection (right).

As shown in Figure 1 above, real-world scenarios like navigating around a parked vehicle or an intersection allow for many distinct, safe paths. The researchers argue that to build truly human-like AVs, we must model this distribution of possibilities, not just a single path.

The Solution: BranchOut

To solve this, the authors propose BranchOut. It is an end-to-end planner, meaning it takes raw sensor data and outputs driving trajectories directly.

What makes BranchOut unique is how it handles uncertainty and diversity. It leverages the generative power of Diffusion Models (the same tech behind image generators like Midjourney) but structures the output using Gaussian Mixture Models (GMMs) to ensure the predictions are distinct and plausible.

The Architecture

Let’s break down the system architecture shown in Figure 2.

Figure 2: Our End-to-End, GMM-Based Diffusion Planner. BranchOut consists of a scene encoder F and a scene-aware transformer-based denoiser D. The encoder F processes multi-view camera images and an HD map to extract scene features that condition the denoiser using multi-head cross-attention (MHCA), with scene features as keys and values to condition the ego query. A GMM head G, selected using high-level driving command c, takes the transformed features and predicts K trajectory means and corresponding weights, enabling the model to select the most likely future trajectory.

The pipeline consists of three main stages:

  1. Scene Encoder (\(\mathcal{F}\)): The model takes multi-view camera images and a high-definition (HD) map as input. It processes these to create a rich feature representation of the world.
  2. Scene-Aware Denoiser (\(\mathcal{D}\)): This is a Transformer-based module. It takes noisy trajectory estimates and “denoises” them based on the scene context.
  3. Branched GMM Head (\(\mathcal{G}\)): Instead of outputting a single path, the model outputs parameters for a Gaussian Mixture Model, providing a distribution of possible trajectories.

Let’s look at the math and mechanics behind these components.

1. The Diffusion Process

The core generative engine is a diffusion model. During training, the system takes a ground-truth trajectory (\(\mathbf{Y}_{\text{ego}}\)) and corrupts it with noise. The amount of noise is determined by a timestep \(t\).

Equation 1: Noisy Trajectory Formulation

Here, \(\mathbf{X}_{\text{ego}}^{(t)}\) is the noisy trajectory at timestep \(t\). \(\alpha(t)\) controls the signal-to-noise ratio. The model’s job is to reverse this process: starting from pure noise, it attempts to reconstruct the clean trajectory.

2. Scene-Aware Denoising

A diffusion model needs context. It can’t just hallucinate a path; it needs to see the road and other cars. BranchOut uses a Transformer that integrates scene features using Multi-Head Cross-Attention (MHCA).

Equation 2: Scene-Aware Representation

In this equation:

  • \(\mathbf{P}\) is the trajectory embedding.
  • \(\mathbf{P}_{\text{agent}}\) and \(\mathbf{P}_{\text{map}}\) represent features from other vehicles and the road map.
  • MHCA allows the model to “attend” to relevant parts of the scene (e.g., looking at the car in front of you) while deciding how to refine the trajectory.

3. The Branched GMM Head

This is the critical innovation. Standard diffusion models can be slow to sample from and might still suffer from mode collapse if not carefully guided. BranchOut appends a GMM Head (\(\mathcal{G}\)) to the end of the denoising process.

The model is conditioned on a high-level command \(c\) (e.g., Turn Left, Go Straight, Turn Right). For the selected command, the network predicts \(K\) different Gaussian components.

Equation 3: GMM Output

For every command, the model outputs:

  • \(\mu_k^m\): The mean trajectory for the \(k\)-th mode.
  • \(\pi_k^m\): The probability weight of that mode.

This hybrid approach allows BranchOut to capture the “multimodal” nature of driving explicitly. It doesn’t just say “go here”; it says “here are \(K\) likely paths, and here is how confident I am in each.”

Training Objective

The model is trained using a composite loss function:

Equation 4: Loss Function

  • \(\mathcal{L}_{\text{plan}}\): Standard reconstruction loss (make the path look like the ground truth).
  • \(\mathcal{L}_{\text{NLL}}\): Negative Log-Likelihood. This maximizes the probability of the ground truth path under the predicted distribution.
  • \(\mathcal{L}_{\text{constraints}}\): Safety constraints (like not hitting curbs).

A New Challenge: Validating Multimodality

Building a multimodal model is only half the battle. How do you prove it works?

If you evaluate BranchOut on standard benchmarks (like nuScenes), you run into the “Single Future” problem mentioned earlier. If BranchOut predicts a valid overtaking maneuver but the dataset shows the car braking, standard metrics will say the model failed.

To fix this, the researchers created a Human-in-the-Loop (HITL) Simulation Benchmark.

Simulating Reality

They used HUGSIM, a photorealistic simulator created from real-world driving logs. They then invited 40 human participants to “re-drive” scenes from the nuScenes dataset using a driving simulator setup.

Crucially, the participants drove the same scenes multiple times. This resulted in a dataset where a single scenario might have 15 different, valid ground-truth trajectories.

Figure 3: Our Multimodal Benchmark Statistics with Higher Coverage and Diversity. Existing unimodal real-world trajectories lack diversity and coverage of modes (left). The collected trajectories, validated as both diverse and realistic (Sec.3.2, Table 1), enable multimodal evaluation (right).

Figure 3 illustrates the difference. The left plot shows the sparse, single paths from the original dataset. The right plot shows the dense, diverse “fan” of trajectories collected from human drivers in the simulation. This density allows for a much fairer evaluation of multimodal planners.

Is the Simulation Realistic?

You might wonder: “Is driving in a simulator really the same as driving a real car?” The authors validated this by comparing their simulated trajectories against the real-world logs.

Table 1: Realism of Collected Trajectories in Simulation. Our simulated trajectories are multimodal and diverse, yet consistently include at least one mode that closely matches the real-world reference from nuScenes, achieving low L2 error at 3s (0.79m). Low Fréchet scores further demonstrate their realism across both photorealistic and digital twin environments.

As shown in Table 1, the simulated trajectories achieved a very low L2 error (0.79m) when compared to the real logs. This means that among the diverse paths humans took in the simulator, at least one of them was usually very close to what the real driver did. This validates the simulation as a reliable proxy for reality.

Experimental Results

With the model built and the new benchmark ready, the researchers compared BranchOut against state-of-the-art planners like UniAD, VAD, and DiffusionDrive.

1. Open-Loop Evaluation (Trajectory Accuracy)

First, they looked at standard accuracy metrics.

  • L2 Error: The standard distance between predicted and actual path.
  • Fréchet Distance: A metric better suited for comparing distributions of paths (multimodality).
  • NLL: Negative Log-Likelihood (how well the model explains the data).

Table 2: Planning Performance Comparison Leveraging Multimodal Ground-Truth. We compare models using single-annotation from nuScenes and our multimodal annotations. Results show notable re-ordering in multimodal metrics (e.g., VAD vs. UniAD), where unimodal L2 penalize plausible predictions, underscoring the need for multimodal evaluation. BranchOut significantly enhances multimodal reasoning by capturing plausible driving behaviors while preventing mode collapse.

Table 2 reveals a fascinating insight. When using standard unimodal metrics (comparing against just one ground truth), older models like UniAD look superior. However, when switching to the multimodal evaluation (comparing against the 15 human trajectories), the leaderboard flips.

BranchOut dominates in Fréchet distance and NLL. This proves that while other models might be overfitting to the specific path in the dataset, BranchOut is successfully capturing the broader range of safe, human-like behaviors.

2. Closed-Loop Evaluation (Driving Performance)

Open-loop metrics are useful, but the real test is letting the model drive in the simulator (Closed-Loop). Does it crash? Does it reach the destination?

The researchers used the HUGSIM Driving Score (HD-Score), which combines safety, comfort, and progress.

Table 3: Closed-Loop Evaluation in HUGSIM. Results are averaged across all difficulty levels. BranchOut demonstrates robust route completion, resulting in the best overall HD-Score.

Table 3 shows that BranchOut achieves the highest Route Completion (\(R_c\)) and HD-Score. It significantly outperforms UniAD and DiffusionDrive. The authors attribute this to the model’s ability to plan multiple potential futures, allowing it to adapt better to dynamic agents and complex road layouts compared to deterministic planners.

Conclusion and Key Takeaways

The paper BranchOut makes a compelling case that the future of autonomous driving lies in acknowledging uncertainty. By forcing models to predict a single path, we have been artificially limiting their intelligence.

Here are the key takeaways for students and practitioners:

  1. Architecture Matters: Combining the generative power of Diffusion with the structured output of GMMs creates a planner that is both expressive and precise.
  2. Evaluation is Hard: Standard metrics (like L2 error against a single ground truth) can be misleading. They penalize creativity and diversity, even when it’s safe.
  3. Data Augmentation via Simulation: We can’t easily collect 15 real-world cars driving the same street at the same time. However, Human-in-the-Loop simulation offers a scalable way to build the diverse datasets needed to train and test the next generation of AVs.

BranchOut moves us a step closer to autonomous vehicles that drive not just like robots following a rail, but like humans navigating a complex, ever-changing world.