Introduction

The dream of autonomous driving has been fueled by rapid advancements in Artificial Intelligence. For years, the industry relied on modular pipelines—separate systems for detecting lanes, identifying pedestrians, planning routes, and controlling the steering wheel. However, the field is shifting toward end-to-end learning, where a single neural network takes raw sensor data and outputs driving commands.

Simultaneously, we have witnessed the explosion of Large Language Models (LLMs) like GPT-4 and LLaMA. These models possess incredible reasoning capabilities and vast amounts of pretrained knowledge about the world. It begs the question: Can we put an LLM in the driver’s seat?

While previous works have attempted this, they often treated the LLM as a “backseat driver”—analyzing images and answering questions (open-loop tasks) rather than actually controlling the car in real-time. When applied to actual driving (closed-loop tasks), these models often struggled with “drifting”—slowly accumulating small errors until the car went off-road or collided.

Enter DriveGPT4-V2. In this deep dive, we will explore how researchers from The University of Hong Kong, Tsinghua University, and Meituan have successfully harnessed the power of Multi-modal Large Language Models (MLLMs) to control a vehicle end-to-end. We will unpack how they solved the drifting problem using a “cheating” expert teacher, how they tokenized the visual world, and why this method achieves state-of-the-art results on challenging benchmarks.

The Challenge: From Chatbots to Drivers

To understand the significance of DriveGPT4-V2, we must first understand the distinction between open-loop and closed-loop systems.

In an open-loop setting, a model is given a snapshot of the world and asked, “What would you do?” It makes a prediction, and the test ends. It’s like taking a written driving test. You might answer every question correctly, but that doesn’t guarantee you can handle a car on the highway.

In a closed-loop setting, the model’s output affects the future. If the model steers slightly left, the camera view for the next frame shifts left. If the model makes a tiny error in frame 1, it is in a slightly wrong position in frame 2. Without robust correction mechanisms, these errors compound—a phenomenon known as distribution shift or drifting. The car eventually finds itself in a situation it has never seen during training (e.g., half-off the road) and panics.

DriveGPT4-V2 is designed specifically to conquer this closed-loop challenge.

As shown in Figure 1, the system takes multi-view camera images and vehicle states, processes them through an LLM, and outputs low-level control signals (throttle, steer, brake) to navigate the environment dynamically.

The Architecture: Anatomy of an LLM Driver

DriveGPT4-V2 is an end-to-end autonomous driving system. This means it doesn’t have a separate “stop sign detector” or “lane follower.” It learns to drive by processing raw inputs and producing driving decisions directly. Let’s break down the architecture step-by-step.

The architecture, detailed in Figure 2 above, consists of three main stages: Input Processing, the LLM Backbone, and Output Decisions.

1. Multi-View Visual Tokenizer (MV-VT)

Driving requires situational awareness. A single front-facing camera isn’t enough; you need to see vehicles approaching from the sides or pedestrians stepping off curbs. DriveGPT4-V2 utilizes three cameras: Front-Left, Front, and Front-Right.

However, LLMs understand tokens (text), not pixels. To bridge this gap, the researchers employ a Multi-View Visual Tokenizer (MV-VT).

Figure 3. Multi-view visual tokenizer (MV-VT) structure. The input images consist of three front views. Each patch is processed through a visual encoder to extract features. Finally, a trained projection layer maps the downsampled feature into the text domain for further processing.

As illustrated in Figure 3, the process works like this:

Image Capture: Three images ($384 \times 384$ resolution) capture the panoramic view.
Visual Encoder: A pretrained encoder (like SigLIP or CLIP) extracts rich feature maps from these images.
Projection: A projection layer maps these visual features into the same embedding space as the text tokens.

This effectively translates the visual “road” into a language the LLM can “read,” preserving both the broad context (is there a turn coming up?) and critical details (is that traffic light red?).

2. The LLM Backbone

Once the images are tokenized, they are combined with text inputs. The system feeds the LLM:

Visual Tokens: The translated camera images.
Vehicle State Tokens: The car’s current speed and the navigation target (e.g., “turn right at the next intersection”).

The LLM acts as the brain. It uses its vast pretrained knowledge to reason about the scene. However, unlike a chatbot that outputs a sentence like “You should speed up,” DriveGPT4-V2 is optimized for speed and precision.

3. Decision Heads (DeciHeads)

This is a crucial innovation. Standard LLMs predict the next word from a vocabulary of roughly 30,000 words. If you want an LLM to output a steering angle like “15.4 degrees,” doing it token-by-token (generating “1”, then “5”, then “.”, then “4”) is slow and imprecise.

Instead, DriveGPT4-V2 replaces the standard vocabulary head with Decision Heads (DeciHeads). The LLM outputs four special tokens. These aren’t words; they are latent vectors that are fed into simple Multi-Layer Perceptrons (MLPs) to predict numerical values directly using regression.

The model predicts four key variables:

Target Speed: How fast should I go?
Target Angle: Where should I steer?
Waypoints: Where will I be in the next 4 seconds?
Route Points: Points along the global path.

While the model predicts Waypoints and Route Points (shown in Figure 4) to help it “understand” the future geometry of the road, the actual driving is controlled by the Target Speed and Target Angle. These two numbers are fed into a standard PID controller (a control loop mechanism) to generate the final throttle, brake, and steer commands.

Training Strategy: Learning from a Cheating Expert

Designing the architecture is only half the battle. The real magic of DriveGPT4-V2 lies in how it learns to drive. The researchers used a two-stage training process involving an “Expert” teacher.

The Problem with Behavior Cloning

The simplest way to train a self-driving car is Behavior Cloning (BC). You record a human (or a perfect autopilot) driving, and you train the model to copy their actions.

The problem? The expert driver is perfect. They never drift off the lane. So, the student model never sees what it looks like to be slightly off-center. If the student makes a tiny mistake during deployment and drifts 10cm to the left, it enters a state it has never seen in the training data. It doesn’t know how to recover, so it makes another mistake, drifts further, and eventually crashes.

The Solution: Online Imitation Learning (DAgger)

To fix this, the researchers employ an approach inspired by the DAgger (Dataset Aggregation) algorithm. They introduce a second model: the Expert LLM.

The Expert LLM has a secret weapon: Privileged Information. While the main model (the student) only sees camera images, the Expert is allowed to access the simulator’s ground truth. It knows the exact coordinates of every pedestrian, the state of every traffic light, and the precise speed of surrounding cars.

Because the Expert has this “god mode” view, it is incredibly robust. It doesn’t need to guess; it knows exactly what to do.

As visualized in Figure 5, the training happens in two stages:

Stage 1 (Behavior Cloning): Both the Student (DriveGPT4-V2) and the Expert are trained on a static dataset collected by a rule-based autopilot. This gives them a basic understanding of driving.
Stage 2 (On-Policy Supervision): The Student tries to drive in the simulator. The Expert watches silently.

If the Student drives well, nothing happens.
If the Student’s prediction differs significantly from the Expert’s (indicating a potential error), the Expert takes over and records the correct action for that specific dangerous situation.
This new data (the student’s mistake + the expert’s correction) is added to the dataset, and the Student is retrained.

This teaches the Student not just how to drive perfectly, but how to recover from mistakes.

The Loss Function

To ensure the model learns all aspects of driving, the training minimizes a combined loss function:

$()\n\\mathcal { L } = \\mathcal { L } _ { T S } + \\mathcal { L } _ { T A } + \\mathcal { L } _ { W P } + \\mathcal { L } _ { R P }\n()$

This equation simply means the model is penalized if it gets any of the four predictions wrong: Target Speed ($TS$), Target Angle ($TA$), Waypoints ($WP$), or Route Points ($RP$).

Experiments and Results

The researchers evaluated DriveGPT4-V2 on the CARLA Longest6 Benchmark. This is a grueling test consisting of 36 long routes full of complex urban scenarios, weather changes, and dynamic agents.

Metrics

The evaluation uses three primary scores:

Route Completion (RC): Did the car finish the route?
Infraction Score (IS): Did the car follow the rules (no collisions, no red light violations)?
Driving Score (DS): The main metric, calculated as $RC \times IS$. A high DS means you finished the route safely.

Table 1 below shows how strictly infractions are penalized. For example, hitting a pedestrian cuts your score in half.

$Table 1. Infraction penalty factors.$

Performance Comparison

How did DriveGPT4-V2 perform against the competition?

Table 2 reveals a stunning result. DriveGPT4-V2 achieved a Driving Score (DS) of 70, significantly outperforming the previous state-of-the-art, Transfuser++ (DS 65).

Crucially, note the “Visual” column. Transfuser++ uses both Cameras (C) and LiDAR (L). DriveGPT4-V2 achieves superior performance using only Cameras (C). This proves that with the reasoning power of an LLM and the right training strategy, expensive LiDAR sensors might not be strictly necessary for high-performance driving in these scenarios.

The model also showed a 0.00 score for Pedestrian collisions, indicating a high level of safety awareness.

Efficiency: Does Size Matter?

One common criticism of LLMs is that they are slow and computationally heavy. You can’t have a 5-second lag when a car pulls out in front of you.

The researchers analyzed the impact of model size on performance and speed (FPS).

Table 3. Efficiency analysis.

Table 3 shows a surprising finding. Scaling up the model from 0.5 Billion parameters (Qwen-0.5B) to 8 Billion parameters (LLaMA3.1-8B) did not improve driving performance (DS remained around 63-65). However, the frame rate (FPS) dropped from 8.1 to 0.4.

A frame rate of 0.4 FPS is unflyable for a car. But at 8.1 FPS, the 0.5B parameter model is capable of real-time control. This suggests that for driving specifically, a smaller, highly optimized “brain” is better than a massive, slow genius.

Ablation Studies: Why This Design?

The researchers performed “ablation studies”—removing parts of the system to see if they actually matter.

1. Does the visual tokenizer matter? Yes. As shown in Table 4, removing the advanced visual tokenizer and just using basic visual features dropped the score from 70 to 56. The LLM needs high-quality visual tokens to understand the road.

Table 4. Ablation studies of DriveGPT4-V2. “WP” and “RP” represent waypoints and route points, respectively.

2. How should we control the car? Many previous methods used the predicted Waypoints to steer the car (i.e., “steer towards the dot”). However, DriveGPT4-V2 predicts the Target Speed and Angle directly.

Table 5. Ablation studies on PID controllers. “WP” indicates utilizing predicted waypoints for PID control; while “TS&RP” means PID control by predicted target speed and route points.

Table 5 confirms that controlling via Speed/Angle (DriveGPT4-V2 row) yields a DS of 63, whereas following Waypoints (WP) only yields 53. Direct control signals are less noisy and result in smoother driving.

3. Why use Decision Heads instead of text? As mentioned earlier, generating numbers as text tokens is slow.

Table 6. Ablation studies on decision heads. “Additional tokens” indicates using more output tokens for prediction.

Table 6 shows that using standard token generation (“Additional tokens”) drops the speed to 1.4 FPS, while the Decision Heads keep it at 8.1 FPS.

Conclusion

DriveGPT4-V2 represents a significant step forward in autonomous driving. It successfully bridges the gap between the linguistic reasoning of LLMs and the real-time, closed-loop requirements of driving a vehicle.

By combining a Multi-View Visual Tokenizer to see the world, a specialized regression head to make precise decisions, and a Expert-guided training strategy to learn recovery behaviors, the system sets a new standard on the CARLA benchmark. Perhaps most importantly, it demonstrates that we can achieve state-of-the-art safety and performance using relatively small LLMs and camera-only setups, paving the way for more accessible and efficient autonomous driving systems in the future.

Introduction#

The Challenge: From Chatbots to Drivers#

The Architecture: Anatomy of an LLM Driver#

1. Multi-View Visual Tokenizer (MV-VT)#

2. The LLM Backbone#

3. Decision Heads (DeciHeads)#

Training Strategy: Learning from a Cheating Expert#

The Problem with Behavior Cloning#

The Solution: Online Imitation Learning (DAgger)#

The Loss Function#

Experiments and Results#

Metrics#

Performance Comparison#

Efficiency: Does Size Matter?#

Ablation Studies: Why This Design?#

Conclusion#