Introduction
In the quest for fully autonomous vehicles, the “planning” module is the brain of the operation. It decides where the car should go, how fast it should drive, and how to handle a sudden pedestrian crossing the street. Historically, this has been the domain of rule-based systems—massive decision trees written by engineers (e.g., “if light is red, stop”). While these work well for standard traffic, they are brittle. They fail to scale to the “long tail” of weird, unpredictable scenarios that happen in the real world.
This led researchers to Imitation Learning (IL), where neural networks learn to copy human drivers. While scalable, IL suffers from compounding errors; if the car drifts slightly off the perfect human path, it enters states it has never seen before and often panics.
Enter Reinforcement Learning (RL). In theory, RL is the perfect candidate. It learns by trial and error, optimizing for long-term goals rather than just copying a single trajectory. It thrives in closed-loop simulations where it can learn from its own mistakes. However, RL has a dirty secret in the autonomous driving world: Reward Engineering.
To make RL work, researchers have traditionally designed incredibly complex, “shaped” reward functions. They give the car points for staying in the lane center, points for speed, points for orientation, and penalties for jerk. This approach, while well-intentioned, creates a new set of problems: the agent learns to “game” the system, finding local loopholes that maximize points without actually driving well.
In this post, we are breaking down a fascinating paper, “CaRL: Learning Scalable Planning Policies with Simple Rewards,” which challenges this status quo. The authors propose a radical simplification: stop using complex rule-based rewards. Instead, use a simple reward based on route completion and scale the training data up to 300 million samples. The result is a planner that outperforms complex state-of-the-art methods on both the CARLA simulator and the real-world nuPlan benchmark.
The Trap of Complex Rewards
In Reinforcement Learning, the agent optimizes its behavior to maximize a scalar reward signal. In games like Chess or Go, the reward is simple: +1 for winning, -1 for losing. In driving, however, the feedback loop is long. If a car crashes 10 seconds into a drive, it’s hard for the algorithm to know which specific steering action caused the crash.
To fix this “credit assignment” problem, researchers typically use dense, shaped rewards. These are rewards given at every time step. For example, a popular method called Roach sums up rewards for:
- Velocity matching a target.
- Position matching a lane center.
- Orientation matching the lane direction.
The problem is that these rewards rely on rule-based experts to define what “optimal” speed or position is. If the rule-based expert is flawed, the RL agent learns flawed behavior. Furthermore, these additive rewards create local minima.
The authors highlight a hilarious but problematic failure mode of these complex rewards. As shown below, an agent trained with the Roach reward learns to wait at green lights.

Why does it do this? The Roach reward gives high points for matching a target velocity. If the light is red, the target velocity is 0. If the agent stops at a green light, it might lose some progress points, but it “hacks” the reward function by setting itself up to perfectly match the “stop” target when the light eventually turns red. It essentially stops driving to farm points.
The Scalability Problem
The second, and perhaps more critical, issue with complex rewards is scalability. In modern Deep Learning, the recipe for success is usually:
- Get a massive dataset.
- Use a massive batch size.
- Train a massive model.
However, when researchers tried to increase the mini-batch size (the amount of data the model looks at before updating its weights) using complex rewards with the PPO (Proximal Policy Optimization) algorithm, performance collapsed.

As Figure 1 illustrates, the complex reward (blue line) performs worse as you feed it more data per update. The gradients get smoothed out, and the optimization gets stuck in those local minima (like the green light waiting strategy) because they are “safe” bets for the reward function.
The authors’ method, CaRL (red line), shows the opposite trend: it gets better as you scale up.
CaRL: The Core Method
The researchers propose a return to first principles. Instead of micromanaging the car with specific rules about lane positioning or orientation, they simply tell the car what to do, not how to do it.
The Simple Reward Function
The new reward design is elegant in its simplicity. It is based primarily on Route Completion (RC).

Here is the breakdown of the terms:
- \(RC_t\): The percentage of the route completed during the current time step. This is the primary driver. The car wants to finish the route.
- \(p_t\) (Soft Penalties): These are multiplicative factors between 0 and 1. If the car drives well, \(p_t = 1\). If the car violates a soft constraint (e.g., speeding, driving slightly off-lane, or driving uncomfortably), \(p_t\) drops below 1, reducing the reward for that step.
- Note: Because these are multiplicative, if you are driving recklessly, your progress reward (\(RC_t\)) is heavily discounted.
- \(T\) (Terminal Penalty): If the car commits a “hard” infraction—like a collision or running a red light—the episode ends immediately, and a penalty \(T\) is subtracted.
This design ensures that the global optimum of the reward matches the global optimum of the driving task. To get maximum points, the car must finish the route (Max RC) while avoiding all penalties. There are no loopholes where stopping at a green light yields points.
Optimizing PPO for Scale
To make this simple reward work, the authors had to fix the training hyperparameters. Most prior work in CARLA (like Roach) used very conservative settings for the PPO algorithm (low learning rates).
The authors switched to the standard “Atari” hyperparameters for PPO, which use a higher learning rate and fewer training epochs.

As shown in Table 2, simply switching to these robust hyperparameters (Atari) reduced training time by 10 hours and increased the Driving Score (DS) by 11 points. This suggests that previous struggles with RL in driving were partly due to poor hyperparameter choices that masked the potential of standard algorithms.
Scaling Up Data
The combination of a Simple Reward + Robust Hyperparameters unlocked the ability to use massive batch sizes (up to 16,384). A large batch size means the gradient estimates are very accurate.
This enabled the team to parallelize data collection across many GPUs. They scaled the training to 300 million samples in CARLA and 500 million samples in nuPlan. For context, prior state-of-the-art methods typically used only 10-20 million samples.
The Model Architecture
The policy itself is a Convolutional Neural Network (CNN) that processes a Bird’s-Eye-View (BEV) semantic image.

The input includes channels for the road, lane markings, other vehicles (encoded with their speed), pedestrians, and traffic lights. The network outputs the steering, throttle, and brake actions directly.
Experiments and Results
The authors evaluated CaRL on two major benchmarks: CARLA (a 3D simulator) and nuPlan (a data-driven simulator based on real-world logs).
Results on CARLA
On the “longest6 v2” benchmark (a rigorous test involving long routes with dense traffic), CaRL demonstrated a massive leap in performance.

- Roach (Prior RL SOTA): 22 Driving Score (DS).
- Think2Drive (World Model RL): 7 DS.
- CaRL (Ours): 64 DS.
CaRL tripled the performance of the previous best RL method. It also outperformed PlanT, a Transformer-based Imitation Learning method (62 DS). It came close to the rule-based expert PDM-Lite (73 DS), but did so with a pure neural network that runs significantly faster (8ms vs 18ms inference time).
Results on nuPlan
nuPlan is notoriously difficult for learning-based methods because it requires reacting to real-world recorded traffic.

In the “Reactive” setting (where background cars react to the ego vehicle), CaRL achieved a Closed-Loop Score (CLS) of 90.6.
- This beats the Diffusion Planner (82.7).
- It is nearly identical to the strong rule-based system PDM-Closed (92.1), which is a remarkable achievement for a pure RL approach.
- Perhaps most importantly, CaRL is order of magnitudes faster, running at 14ms compared to Diffusion Planner’s 138ms or PLUTO’s 237ms.
Failure Cases
While CaRL is a significant step forward, it is not perfect. The simplicity of the reward means the agent has to figure out everything from scratch, which can lead to specific types of errors.

In nuPlan (Figure 9), the agent sometimes struggles with aggressive avoidance. In (a), to avoid a pedestrian, it veers entirely off-road. In CARLA, the authors noted that CaRL sometimes misses highway exits because it hasn’t fully learned to plan far enough ahead for lane changes to meet the route requirement, or it gets rear-ended by aggressive simulated drivers running red lights.
Conclusion
The CaRL paper provides a compelling counter-narrative to the trend of increasing complexity in autonomous driving systems. It shows that we don’t necessarily need more complex reward functions or elaborate rule-based teachers to train good driving policies.
Instead, the formula for success was:
- Simplify the Objective: Use a reward (Route Completion) that aligns perfectly with the end goal.
- Trust the Algorithm: Use standard PPO with high learning rates.
- Scale, Scale, Scale: Use the simplified reward to unlock massive batch sizes and train on hundreds of millions of samples.
By removing the “hand-holding” of shaped rewards, CaRL allows the agent to learn more robust, natural driving behaviors that scale better with data. As we move toward end-to-end driving systems, this suggests that the bottleneck might not be the algorithm, but rather how we define the task we want the car to solve.
](https://deep-paper.org/en/paper/2504.17838/images/cover.png)