Imagine you are driving down a narrow street. You see a delivery truck parked on the right and a ball rolling into the street from behind it. You don’t just “detect a ball”; you immediately simulate a future where a child might be chasing that ball, and you instinctively prepare to brake. This ability to reason about the scene and anticipate the future is second nature to humans.

However, for autonomous vehicles (AVs), this is incredibly difficult. Most modern AV systems rely on End-to-End (E2E) Imitation Learning. They look at millions of hours of human driving and try to copy the steering and pedal inputs given similar visual inputs. While this works for driving straight down a highway, it often fails in “closed-loop” scenarios—complex, interactive environments where the car’s actions change the state of the world, and where one small mistake can compound into a crash.

In this post, we will deep dive into ReasonPlan, a new framework proposed by researchers from the Chinese Academy of Sciences and EACON. ReasonPlan attempts to solve the fragility of imitation learning by integrating Multimodal Large Language Models (MLLMs). Instead of blindly copying actions, ReasonPlan forces the vehicle to predict the future visual scene and engage in a structured chain-of-thought before making a decision.

The Problem: The Limits of Imitation

Before we dissect the solution, we need to understand the bottleneck. Traditional E2E autonomous driving treats the driving task as a massive regression problem: map pixels to trajectory points.

While efficient, these systems suffer from:

  1. Causal Confusion: They might learn that “brake lights ahead” means “stop,” but they don’t understand why the car ahead stopped.
  2. Poor Generalization: If they encounter a scenario not present in their training data (an “out-of-distribution” case), they often freeze or behave erratically.
  3. Lack of Interpretability: When a neural network swerves, we rarely know why.

Recently, Multimodal Large Language Models (MLLMs) like GPT-4V or LLaVA have shown an incredible ability to reason about images. However, simply plugging an LLM into a car hasn’t worked well. Most attempts operate in an “open-loop” setting (answering questions about a static image) or lack the spatial precision required to actually drive a car safely.

Enter ReasonPlan: A Dual-Mechanism Approach

The authors of ReasonPlan propose a unified framework that fine-tunes an MLLM for closed-loop driving. Their core hypothesis is that a truly robust driver needs two specific capabilities:

  1. Visual Foresight: The ability to imagine what the world will look like in the future based on current actions.
  2. Explicit Reasoning: The ability to verbally articulate the state of the world, risks, and decisions.

Figure 2 below provides a high-level overview of the architecture.

Figure 2: The pipeline of ReasonPlan, a holistic reasoning framework for closed-loop driving. It consists of two main modules: (a) the next scene prediction to enhance scene representation and understanding, which is conditioned on current context information; (b) the supervised decision CoT process to obtain the final planning trajectory. (c) the two training stages.

The system takes multi-view camera images and vehicle telemetry (speed, acceleration) as input. These pass through encoders and projectors to translate visual data into “tokens” the LLM can understand. The magic happens in the two parallel branches: the Next Scene Prediction (NSP) and the Decision Chain-of-Thought (DeCoT).

Let’s break these down step-by-step.

1. Self-Supervised Next Scene Prediction (NSP)

One of the biggest challenges in using LLMs for robotics is grounding—ensuring the model’s internal representation of the world matches physical reality. To enforce this, ReasonPlan introduces a self-supervised task called Next Scene Prediction.

The idea is simple but powerful: If you understand the scene, you should be able to predict what it will look like 3 seconds from now.

As shown in Figure 3, the model takes the current visual tokens and the ego-vehicle’s state (velocity, acceleration, command) and tries to predict the latent visual features of the future frame.

Figure 3: The process of NSP task.

The Visual Encoder

First, the system processes the raw images. The authors use a SigLIP vision encoder. To handle high-resolution details, they use an “AnyRes” strategy, cropping the images into grids (patches).

\[ \begin{array} { r } { \mathbf { Z } _ { v _ { t } } = \mathtt { S i g L I P } \big ( \mathtt { A n y R e s } ( \mathbf { X } _ { v _ { t } } ) \big ) , \mathbf { H } _ { v _ { t } } = \mathtt { M L P } \big ( \mathbf { Z } _ { v _ { t } } \big ) , } \end{array} \]

In simple terms, this equation transforms the raw pixels (\(\mathbf{X}_{v_t}\)) into a compressed feature vector (\(\mathbf{H}_{v_t}\)) that aligns with the language model’s embedding space.

Context Encoding

Simultaneously, the car’s physical state is encoded. This is crucial because what the scene looks like in 3 seconds depends heavily on how fast you are moving and whether you are turning.

\[ \begin{array} { r } { { \bf H } _ { c _ { t } } = \tt M L P ( \pmb { v } , \pmb { a } , { \mathrm { c m d } } ) , } \end{array} \]

The Prediction Loss

The model then attempts to predict the future visual features \(\hat{\mathbf{H}}_{v_{t+3}}\). The training objective is to minimize the difference between this predicted future and the actual future image features (obtained during training from the recorded log).

\[ \hat { \mathbf { H } } _ { v _ { t + 3 } } = \mathrm { L L M } ( \mathrm { C o n c a t } ( \mathbf { H } _ { c _ { t } } , \mathbf { H } _ { v _ { t } } ) ) , \mathcal { L } _ { \mathrm { i m a g e } } = \| \hat { \mathbf { H } } _ { v _ { t + 3 } } [ \cdot \mathbf { f r o n t } ] - \mathbf { H } _ { v _ { t + 3 } } [ \cdot \mathbf { f r o n t } ] \| ^ { 2 } , \]

By using a Mean Squared Error (MSE) loss on the latent features, the model learns the dynamics of the environment (e.g., “if I move forward at 10m/s, that car will get larger”) without needing manual labels.

2. Decision Chain-of-Thought (DeCoT)

While NSP gives the model visual intuition, it doesn’t tell the car how to drive legally or safely. This is where the Decision Chain-of-Thought comes in.

Standard models map inputs directly to trajectory coordinates. ReasonPlan, however, forces the LLM to “talk through” the problem first. This leverages the LLM’s vast semantic knowledge to handle complex scenarios.

The process is structured into four distinct reasoning stages:

  1. Scene Understanding: Identify lanes, drivable areas, and weather.
  2. Traffic Sign Recognition: Spot stop signs, traffic lights, and warnings.
  3. Critical Object Identification: Identify which pedestrian or car is actually a risk (not just listing all objects).
  4. Meta Action: Decide on a high-level maneuver (e.g., “Change Lane Left”, “Decelerate”).

Only after these four steps does the model output the final Trajectory Waypoints.

The authors constructed a specific dataset to train this capability, which we will discuss shortly. The loss function here is a standard Cross-Entropy loss on the text tokens, supervising the model to generate the correct reasoning steps.

\[ p ( \mathbf { X } _ { a } | \mathbf { X } _ { v } , \mathbf { X } _ { p } ) = \prod _ { i = 1 } ^ { L } p ( \pmb { x } _ { i } | \mathbf { X } _ { v } , \mathbf { X } _ { p , < i } , \mathbf { X } _ { a , < i } ) , \mathcal { L } _ { \mathrm { t e x t } } = - \log p ( \mathbf { X } _ { a } | \mathbf { X } _ { v } , \mathbf { X } _ { p } ) , \]

Unified Training

The final piece of the puzzle is combining these two objectives. The model is trained to minimize both the image prediction error and the reasoning text error simultaneously:

\[ \mathcal { L } _ { \mathrm { t o t a l } } = \lambda _ { 1 } \cdot \mathcal { L } _ { \mathrm { i m a g e } } + \lambda _ { 2 } \cdot \mathcal { L } _ { \mathrm { t e x t } } \]

This dual-objective ensures that the visual embeddings used for reasoning are also rich in temporal, dynamic information.

The Fuel: PDR Dataset

A model is only as good as its data. Existing autonomous driving datasets are often just video logs with trajectory labels. They lack the “reasoning” component—the why.

The researchers created the Planning-oriented Decision Reasoning (PDR) dataset. They used an automated pipeline on the Bench2Drive simulator to generate 210,000 samples.

Figure 4: An annotated sample of the PDR dataset.

As seen in Figure 4 above, the dataset annotates the scene in rich detail. Notice how it explicitly states: “The nearest obstacle to the ego vehicle is car… Its motion state is dynamic… Reasoning advises caution.” This structure bridges the gap between raw pixels and the final numerical trajectory (the coordinates at the bottom).

Data statistics show the diversity of this dataset, focusing on key driving concepts like “ego,” “lane,” and “obstacle.”

Figure A2: Data statistics of PDR.(a) Distribution of reasoning sentence length. (b) The key vocabulary chart of PDR.

Experiments & Results

So, does adding imagination (NSP) and reasoning (DeCoT) actually make the car drive better? The authors evaluated ReasonPlan on Bench2Drive, a challenging closed-loop benchmark that requires completing complex routes in a simulator.

Quantitative Performance

The results are summarized in the radar chart below. ReasonPlan (purple line) encompasses the other methods, showing superior performance across almost all metrics.

Figure 1:The proposed ReasonPlan achieves leading performance on most of metrics compared with E2E methods.

Looking at the detailed numbers in Table 1, ReasonPlan achieves a Driving Score (DS) of 64.01, which is significantly higher than leading imitation learning methods like VAD and UniAD. Notably, it improves the Success Rate (SR) to 34.55%, doubling the performance of some baselines.

Table 1: Planning and Multi-Ability Performance in Bench2Drive.* denotes expert feature distillation.

It’s important to note the L2 Error (Open-loop). ReasonPlan has the lowest error (0.61m), meaning its predicted trajectories align very closely with expert human driving.

Generalization to Unseen Scenarios

The true test of intelligence is handling the unknown. The researchers tested the model on DOS (DriveOcclusionSim), a benchmark filled with “corner cases” like pedestrians suddenly popping out from behind parked cars or obstructed left turns. Crucially, the model was never trained on this data (Zero-Shot).

Table 2: Planning Performance in DOS.

As Table 2 shows, ReasonPlan achieved an average driving score of 78.02, drastically outperforming baselines that hovered around 57-71. This proves that the reasoning capability allows the model to adapt to new, dangerous situations without specific training.

Qualitative Analysis

Seeing is believing. In Figure 5, we see a comparison between ReasonPlan and baseline methods (VAD, UniAD).

Figure 5: Qualitative comparison of ReasonPlan with baselines.

  • Left Case (Junction): Baseline methods stalled at a green light, likely confused by the complex visual signals. ReasonPlan correctly reasoned about the signal change and proceeded.
  • Right Case (Pedestrian): This is the classic “ball chasing” scenario. A pedestrian emerges suddenly. VAD and UniAD failed to react in time, resulting in a collision or near-miss. ReasonPlan anticipated the risk, decelerated early, and stopped safely.

Why does it work? (Ablation Studies)

The authors performed “ablation studies” to verify if both the NSP and DeCoT modules were necessary.

Table 3:Ablations on (a) Each Module and (b) Reasoning Steps (Dev1O).

  • Row 2 vs 1: Adding Next Scene Prediction (NSP) alone improved the Driving Score from 41.84 to 52.61. This confirms that understanding visual dynamics helps planning.
  • Row 3 vs 1: Adding Reasoning (DeCoT) alone improved it to 53.97.
  • Row 4: Combining both resulted in the highest score (57.83).

This synergistic effect suggests that visual foresight and semantic reasoning are complementary skills for autonomous driving.

Conclusion and Future Outlook

ReasonPlan represents a significant step forward in End-to-End autonomous driving. By moving away from pure imitation and towards a framework that integrates visual prediction and structured reasoning, the system achieves state-of-the-art results in closed-loop environments.

Key takeaways for students and researchers:

  1. Structure Matters: Feeding raw images to an LLM isn’t enough. Structuring the output into logical steps (Scene -> Signs -> Action) drastically improves reliability.
  2. Prediction is Understanding: The self-supervised Next Scene Prediction task is a powerful way to force the model to learn physics and dynamics without expensive labeling.
  3. Language is a Control Interface: Using natural language as an intermediate representation allows for better interpretability and generalization to zero-shot scenarios.

While ReasonPlan shows great promise, it relies on a 0.5B parameter model. As hardware improves, one can imagine what a 7B or 70B parameter model—with even deeper reasoning capabilities—could achieve in the driver’s seat. For now, ReasonPlan demonstrates that teaching cars to “think” before they drive is a winning strategy.