Imagine you are driving down a narrow street. You see a delivery truck parked on the right and a ball rolling into the street from behind it. You don’t just “detect a ball”; you immediately simulate a future where a child might be chasing that ball, and you instinctively prepare to brake. This ability to reason about the scene and anticipate the future is second nature to humans.
However, for autonomous vehicles (AVs), this is incredibly difficult. Most modern AV systems rely on End-to-End (E2E) Imitation Learning. They look at millions of hours of human driving and try to copy the steering and pedal inputs given similar visual inputs. While this works for driving straight down a highway, it often fails in “closed-loop” scenarios—complex, interactive environments where the car’s actions change the state of the world, and where one small mistake can compound into a crash.
In this post, we will deep dive into ReasonPlan, a new framework proposed by researchers from the Chinese Academy of Sciences and EACON. ReasonPlan attempts to solve the fragility of imitation learning by integrating Multimodal Large Language Models (MLLMs). Instead of blindly copying actions, ReasonPlan forces the vehicle to predict the future visual scene and engage in a structured chain-of-thought before making a decision.
The Problem: The Limits of Imitation
Before we dissect the solution, we need to understand the bottleneck. Traditional E2E autonomous driving treats the driving task as a massive regression problem: map pixels to trajectory points.
While efficient, these systems suffer from:
- Causal Confusion: They might learn that “brake lights ahead” means “stop,” but they don’t understand why the car ahead stopped.
- Poor Generalization: If they encounter a scenario not present in their training data (an “out-of-distribution” case), they often freeze or behave erratically.
- Lack of Interpretability: When a neural network swerves, we rarely know why.
Recently, Multimodal Large Language Models (MLLMs) like GPT-4V or LLaVA have shown an incredible ability to reason about images. However, simply plugging an LLM into a car hasn’t worked well. Most attempts operate in an “open-loop” setting (answering questions about a static image) or lack the spatial precision required to actually drive a car safely.
Enter ReasonPlan: A Dual-Mechanism Approach
The authors of ReasonPlan propose a unified framework that fine-tunes an MLLM for closed-loop driving. Their core hypothesis is that a truly robust driver needs two specific capabilities:
- Visual Foresight: The ability to imagine what the world will look like in the future based on current actions.
- Explicit Reasoning: The ability to verbally articulate the state of the world, risks, and decisions.
Figure 2 below provides a high-level overview of the architecture.

The system takes multi-view camera images and vehicle telemetry (speed, acceleration) as input. These pass through encoders and projectors to translate visual data into “tokens” the LLM can understand. The magic happens in the two parallel branches: the Next Scene Prediction (NSP) and the Decision Chain-of-Thought (DeCoT).
Let’s break these down step-by-step.
1. Self-Supervised Next Scene Prediction (NSP)
One of the biggest challenges in using LLMs for robotics is grounding—ensuring the model’s internal representation of the world matches physical reality. To enforce this, ReasonPlan introduces a self-supervised task called Next Scene Prediction.
The idea is simple but powerful: If you understand the scene, you should be able to predict what it will look like 3 seconds from now.
As shown in Figure 3, the model takes the current visual tokens and the ego-vehicle’s state (velocity, acceleration, command) and tries to predict the latent visual features of the future frame.

The Visual Encoder
First, the system processes the raw images. The authors use a SigLIP vision encoder. To handle high-resolution details, they use an “AnyRes” strategy, cropping the images into grids (patches).
\[ \begin{array} { r } { \mathbf { Z } _ { v _ { t } } = \mathtt { S i g L I P } \big ( \mathtt { A n y R e s } ( \mathbf { X } _ { v _ { t } } ) \big ) , \mathbf { H } _ { v _ { t } } = \mathtt { M L P } \big ( \mathbf { Z } _ { v _ { t } } \big ) , } \end{array} \]In simple terms, this equation transforms the raw pixels (\(\mathbf{X}_{v_t}\)) into a compressed feature vector (\(\mathbf{H}_{v_t}\)) that aligns with the language model’s embedding space.
Context Encoding
Simultaneously, the car’s physical state is encoded. This is crucial because what the scene looks like in 3 seconds depends heavily on how fast you are moving and whether you are turning.
\[ \begin{array} { r } { { \bf H } _ { c _ { t } } = \tt M L P ( \pmb { v } , \pmb { a } , { \mathrm { c m d } } ) , } \end{array} \]The Prediction Loss
The model then attempts to predict the future visual features \(\hat{\mathbf{H}}_{v_{t+3}}\). The training objective is to minimize the difference between this predicted future and the actual future image features (obtained during training from the recorded log).
\[ \hat { \mathbf { H } } _ { v _ { t + 3 } } = \mathrm { L L M } ( \mathrm { C o n c a t } ( \mathbf { H } _ { c _ { t } } , \mathbf { H } _ { v _ { t } } ) ) , \mathcal { L } _ { \mathrm { i m a g e } } = \| \hat { \mathbf { H } } _ { v _ { t + 3 } } [ \cdot \mathbf { f r o n t } ] - \mathbf { H } _ { v _ { t + 3 } } [ \cdot \mathbf { f r o n t } ] \| ^ { 2 } , \]By using a Mean Squared Error (MSE) loss on the latent features, the model learns the dynamics of the environment (e.g., “if I move forward at 10m/s, that car will get larger”) without needing manual labels.
2. Decision Chain-of-Thought (DeCoT)
While NSP gives the model visual intuition, it doesn’t tell the car how to drive legally or safely. This is where the Decision Chain-of-Thought comes in.
Standard models map inputs directly to trajectory coordinates. ReasonPlan, however, forces the LLM to “talk through” the problem first. This leverages the LLM’s vast semantic knowledge to handle complex scenarios.
The process is structured into four distinct reasoning stages:
- Scene Understanding: Identify lanes, drivable areas, and weather.
- Traffic Sign Recognition: Spot stop signs, traffic lights, and warnings.
- Critical Object Identification: Identify which pedestrian or car is actually a risk (not just listing all objects).
- Meta Action: Decide on a high-level maneuver (e.g., “Change Lane Left”, “Decelerate”).
Only after these four steps does the model output the final Trajectory Waypoints.
The authors constructed a specific dataset to train this capability, which we will discuss shortly. The loss function here is a standard Cross-Entropy loss on the text tokens, supervising the model to generate the correct reasoning steps.
\[ p ( \mathbf { X } _ { a } | \mathbf { X } _ { v } , \mathbf { X } _ { p } ) = \prod _ { i = 1 } ^ { L } p ( \pmb { x } _ { i } | \mathbf { X } _ { v } , \mathbf { X } _ { p , < i } , \mathbf { X } _ { a , < i } ) , \mathcal { L } _ { \mathrm { t e x t } } = - \log p ( \mathbf { X } _ { a } | \mathbf { X } _ { v } , \mathbf { X } _ { p } ) , \]Unified Training
The final piece of the puzzle is combining these two objectives. The model is trained to minimize both the image prediction error and the reasoning text error simultaneously:
\[ \mathcal { L } _ { \mathrm { t o t a l } } = \lambda _ { 1 } \cdot \mathcal { L } _ { \mathrm { i m a g e } } + \lambda _ { 2 } \cdot \mathcal { L } _ { \mathrm { t e x t } } \]This dual-objective ensures that the visual embeddings used for reasoning are also rich in temporal, dynamic information.
The Fuel: PDR Dataset
A model is only as good as its data. Existing autonomous driving datasets are often just video logs with trajectory labels. They lack the “reasoning” component—the why.
The researchers created the Planning-oriented Decision Reasoning (PDR) dataset. They used an automated pipeline on the Bench2Drive simulator to generate 210,000 samples.

As seen in Figure 4 above, the dataset annotates the scene in rich detail. Notice how it explicitly states: “The nearest obstacle to the ego vehicle is car… Its motion state is dynamic… Reasoning advises caution.” This structure bridges the gap between raw pixels and the final numerical trajectory (the coordinates at the bottom).
Data statistics show the diversity of this dataset, focusing on key driving concepts like “ego,” “lane,” and “obstacle.”

Experiments & Results
So, does adding imagination (NSP) and reasoning (DeCoT) actually make the car drive better? The authors evaluated ReasonPlan on Bench2Drive, a challenging closed-loop benchmark that requires completing complex routes in a simulator.
Quantitative Performance
The results are summarized in the radar chart below. ReasonPlan (purple line) encompasses the other methods, showing superior performance across almost all metrics.

Looking at the detailed numbers in Table 1, ReasonPlan achieves a Driving Score (DS) of 64.01, which is significantly higher than leading imitation learning methods like VAD and UniAD. Notably, it improves the Success Rate (SR) to 34.55%, doubling the performance of some baselines.

It’s important to note the L2 Error (Open-loop). ReasonPlan has the lowest error (0.61m), meaning its predicted trajectories align very closely with expert human driving.
Generalization to Unseen Scenarios
The true test of intelligence is handling the unknown. The researchers tested the model on DOS (DriveOcclusionSim), a benchmark filled with “corner cases” like pedestrians suddenly popping out from behind parked cars or obstructed left turns. Crucially, the model was never trained on this data (Zero-Shot).

As Table 2 shows, ReasonPlan achieved an average driving score of 78.02, drastically outperforming baselines that hovered around 57-71. This proves that the reasoning capability allows the model to adapt to new, dangerous situations without specific training.
Qualitative Analysis
Seeing is believing. In Figure 5, we see a comparison between ReasonPlan and baseline methods (VAD, UniAD).

- Left Case (Junction): Baseline methods stalled at a green light, likely confused by the complex visual signals. ReasonPlan correctly reasoned about the signal change and proceeded.
- Right Case (Pedestrian): This is the classic “ball chasing” scenario. A pedestrian emerges suddenly. VAD and UniAD failed to react in time, resulting in a collision or near-miss. ReasonPlan anticipated the risk, decelerated early, and stopped safely.
Why does it work? (Ablation Studies)
The authors performed “ablation studies” to verify if both the NSP and DeCoT modules were necessary.

- Row 2 vs 1: Adding Next Scene Prediction (NSP) alone improved the Driving Score from 41.84 to 52.61. This confirms that understanding visual dynamics helps planning.
- Row 3 vs 1: Adding Reasoning (DeCoT) alone improved it to 53.97.
- Row 4: Combining both resulted in the highest score (57.83).
This synergistic effect suggests that visual foresight and semantic reasoning are complementary skills for autonomous driving.
Conclusion and Future Outlook
ReasonPlan represents a significant step forward in End-to-End autonomous driving. By moving away from pure imitation and towards a framework that integrates visual prediction and structured reasoning, the system achieves state-of-the-art results in closed-loop environments.
Key takeaways for students and researchers:
- Structure Matters: Feeding raw images to an LLM isn’t enough. Structuring the output into logical steps (Scene -> Signs -> Action) drastically improves reliability.
- Prediction is Understanding: The self-supervised Next Scene Prediction task is a powerful way to force the model to learn physics and dynamics without expensive labeling.
- Language is a Control Interface: Using natural language as an intermediate representation allows for better interpretability and generalization to zero-shot scenarios.
While ReasonPlan shows great promise, it relies on a 0.5B parameter model. As hardware improves, one can imagine what a 7B or 70B parameter model—with even deeper reasoning capabilities—could achieve in the driver’s seat. For now, ReasonPlan demonstrates that teaching cars to “think” before they drive is a winning strategy.
](https://deep-paper.org/en/paper/2505.20024/images/cover.png)