Introduction

In the last few years, the landscape of Artificial Intelligence has been dominated by one specific architecture: the Transformer. Large Language Models (LLMs) like GPT-4 and Llama have revolutionized how machines process sequence data, demonstrating reasoning capabilities that seem almost emergent.

This naturally leads to a provocative question for researchers in other fields: If a model is excellent at predicting the next word in a sentence, can it be equally good at predicting the next move of a car in traffic?

Superficially, the domains are strikingly similar. Both involve autoregressive sequence modeling (predicting \(t+1\) given \(0...t\)), token-based representations, and heavy reliance on context. In language, context is the paragraph you just wrote; in driving, context is the road geometry and the behavior of other drivers.

However, copying and pasting architecture from one domain to another rarely works out of the box. A road is not a sentence, and a trajectory is not a string of text.

In this post, we are deep-diving into a fascinating paper titled “Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving.” The researchers conducted a systematic evaluation of five core LLM components—Tokenizers, Positional Embeddings, Pre-training, Post-training, and Test-time Computing—to see what transfers to autonomous driving and, crucially, what breaks.

Figure 1: The autonomous driving motion generation task bears a striking resemblance to large language models.

As illustrated in Figure 1, the high-level pipelines are almost identical. But as we will discover, the devil is in the details.

The Task: Motion Generation

Before dissecting the architecture, we must define the problem. The authors focus on Motion Generation, specifically the “Sim Agents” task.

In this setting, the model is given:

  1. Context (\(C_t\)): The static environment (HD maps, traffic lights, road boundaries).
  2. History (\(H_i^t\)): The past 1 second of motion for all agents (cars, pedestrians, cyclists).

The goal is to generate realistic, 8-second future trajectories (\(\hat{Y}_i\)) for multiple agents simultaneously. This is mathematically formulated as:

Equation 1: The formulation of motion generation.

This isn’t just about predicting one likely path. A good simulator needs to generate diverse but plausible rollouts—32 different ways the future could unfold, all obeying physics and traffic laws.

The Core Method: Adapting LLM Modules

The researchers built a GPT-style baseline model to test their hypotheses. This model consists of two main parts: a Scene Encoder and a Motion Generator.

The Architecture

The Scene Encoder compresses the map and agent history into tokens. Unlike a standard text encoder, this network has to process geometric data (points and vectors). As shown below, it uses a PointNet-like structure to embed agents and map elements, followed by attention layers to mix information between agents (A2A), maps (M2A), and vice versa.

Figure 6: The architecture of the scene encoder.

These embeddings are then fed into the Motion Generator, which acts as the “GPT” of the system. It uses an autoregressive decoder to predict the next motion token. Notice the complex attention mechanism: it doesn’t just look at previous tokens (Temporal Attention); it also attends to other agents and the map at every step.

Figure 7: The architecture of the motion generator.

With this baseline established, let’s look at the five modules the authors investigated.

Module 1: Tokenizing

The Problem: LLMs operate on discrete vocabularies (words or sub-words). Vehicle motion is continuous (velocity, acceleration, position). How do you turn a continuous curve into a discrete token ID?

The Approaches:

  1. Data-Driven (e.g., SMART): Cluster millions of real-world trajectories into, say, 1,024 “standard” shapes. This is like creating a dictionary of common phrases.
  2. Model-Based (Verlet): Discretize the physics (acceleration/velocity) rather than the raw trajectory shapes.

The authors propose a Verlet-Agent tokenizer. They transform trajectories into the agent’s local coordinate frame and discretize the acceleration space. This differs from prior work (MotionLM) which used a scene-centric frame.

Why does this matter? If you tokenize based on global coordinates, “accelerating north” and “accelerating east” look like different tokens. If you use agent-centric coordinates, “accelerating forward” is the same token regardless of which way the car is facing. This consistency improves generalization.

Figure 8: Influence of initial velocity on the travel distance induced by the same motion token sequence.

As Figure 8 shows, the Verlet wrapper ensures that a token represents a change in motion state (acceleration), ensuring physical consistency regardless of initial velocity.

The Verdict: The agent-centric model-based approach outperforms the data-driven clustering approach significantly. It achieves higher accuracy with a much smaller vocabulary size (169 tokens vs 2048), as seen in Table 1.

Table 1: Tokenizer Design

Module 2: Positional Embedding (PE)

The Problem: In NLP, position is 1D (word 1, word 2, word 3). In driving, position is 2D/3D (x, y, heading) and relational. You need to know that Car A is behind Car B and to the left of the Lane Marker.

Standard NLP positional embeddings (Sine/Cosine functions added to the input) fail here because they don’t capture the relative geometry between 2D objects effectively.

The Solution: The authors adapt DRoPE (Directional Rotary Positional Embedding). Instead of adding a vector, RoPE rotates the embedding in high-dimensional space. The angle of rotation corresponds to the position/orientation.

Equation defining standard PE

The math for rotating based on position and heading looks like this:

Equation for Position Rotation Equation for Heading Rotation

The Twist: The authors found that standard DRoPE had a flaw. If you encode every lane segment in its own local coordinate system, two parallel lanes look mathematically identical to the network. The attention mechanism struggles to distinguish them.

They propose Global-DRoPE. They keep the coordinates in the global frame (preserving the semantic distinction between “left lane” and “right lane”) but use the Rotary mechanism to inject relative position information during the attention calculation.

Figure 11: Visualization of lane segment features under DRoPE using inputs from local (top row) and global (bottom row) coordinates.

Look at Figure 11. The top row (Local coordinates) shows high redundancy—the features look the same. The bottom row (Global coordinates) shows smooth transitions and distinct features for different parts of the intersection. This leads to better trajectory generation, as seen in Figure 12 below.

Figure 12: Comparison of motion generation results using local vs. global coordinate encodings. Table 2: Comparison of different positional embedding methods.

Module 3: Pre-training and Scaling Laws

The Question: One of the defining characteristics of LLMs is the Scaling Law: adding more data and parameters predictably lowers the loss. Does this hold for driving?

The Experiment: The authors trained models of varying sizes (Mini to Large) on datasets ranging from 10% to 800% (augmented) of the Waymo Open Motion Dataset.

The Verdict: Yes, scaling laws apply. As shown in Figure 9, loss decreases linearly (on a log-log scale) as data increases. However, they hit a ceiling: the “Large” model started overfitting even on the 800% dataset, suggesting that simply augmenting data isn’t enough—we need genuinely larger datasets (like all of Waymo or nuPlan) to push parameters further.

Figure 9: Scaling Law of Data Volume and Model Parameters Table 3: Scaling Law on GPT Layers

Qualitatively, the pre-trained model learns complex behaviors like U-turns and yielding to pedestrians (Figure 13).

Figure 13: The results of three parallel rollouts after pre-training.

Module 4: Post-training (Alignment)

The Problem: A pre-trained model (trained via imitation learning) is like a text completion model—it mimics the average human. But “average” human driving isn’t always safe or optimal. We need to align the model, just like we use RLHF (Reinforcement Learning from Human Feedback) to stop ChatGPT from being toxic.

The Approaches: The authors compared three alignment techniques:

  1. SFT (Supervised Fine-Tuning): Fine-tuning on safety-critical cases.
  2. REINFORCE / A2C: Standard Reinforcement Learning methods maximizing a reward function (penalizing collisions).
  3. GRPO (Group Relative Policy Optimization): A newer method used in LLMs (like DeepSeek-Math).

Why GRPO? Standard RL (A2C) requires training a “Critic” network to estimate the value of a state. This is notoriously unstable in high-dimensional driving tasks. GRPO skips the Critic. instead, it generates a group of \(N\) trajectories for the same input, ranks them, and updates the policy to favor the better ones in the group relative to the average.

Figure 10: The reinforcement learning framework.

The GRPO loss function includes a KL-divergence term to ensure the model doesn’t drift too far from the original human-like physics (Equation below).

Equation: GRPO Loss Function

The Verdict: GRPO wins. As shown in Table 4, it achieves the best balance. Standard RL (REINFORCE/A2C) creates overly conservative “robot-like” drivers that refuse to move to avoid collisions (low Realism score). GRPO maintains human-like realism while significantly improving safety.

Table 4: Comparison of post-training results.

The difference is stark in the heatmaps below. SFT changes little. REINFORCE converges to a single safe path (boring/predictable). GRPO explores a safe distribution of paths.

Figure 3: Comparison of different post-training methods showing endpoint distributions.

Figure 14 shows specific cases where GRPO solves “failed” scenarios from the pre-trained model, such as safely merging or yielding to pedestrians.

Figure 14: A qualitative comparison between GRPO and the pre-trained model.

Module 5: Test-time Computing

The Problem: Even the best model occasionally outputs a collision. In LLMs, we use techniques like “Chain of Thought” or “Best-of-N” sampling to improve output during inference.

The Solution: The authors implement a Generate-Search-Cluster pipeline.

  1. Generate: Produce many rollouts (e.g., 1024) in parallel.
  2. Search: Filter out any trajectories that collide with the map or agents.
  3. Cluster: Use K-Medoids to group the remaining safe trajectories and pick diverse representatives.

The Verdict: This is the single most effective way to ensure safety. As Figure 15 visualizes, raw outputs have many “red X” collisions. The “Search + Cluster” approach almost eliminates them.

Figure 15: Qualitative comparison of test-time computing methods. Table 5: Comparison of different test-time computing strategies

Results and Limitations

Combining all these optimized modules (Verlet-Agent Tokenizer + Global-DRoPE + GRPO + Search/Cluster), the authors submitted their model to the Waymo Sim Agents Leaderboard.

They achieved competitive results against State-of-the-Art (SOTA) methods, as seen in Table 6.

Table 6: Comparison with SOTAs in Sim Agents Leaderboard

The Evaluation Paradox

The authors highlight a critical flaw in current benchmarks. The standard metric is “Realism Likelihood,” which measures how close the generated trajectory is to the ground truth.

But what if the ground truth is bad?

In Figure 4, the human driver (Ground Truth) actually went off-road due to noise or error. The pre-trained model mimicked this error and got a high score. The optimized (GRPO) model fixed the error and stayed on the road, but got a low score because it didn’t match the ground truth. This indicates that as AD models get better than humans, our metrics need to evolve.

Figure 4: A limitation of using likelihood-based evaluation. Table 7: Metrics for the scenario above.

Conclusion

This paper provides a roadmap for adapting the “LLM playbook” to autonomous driving. The answer to “Do LLM modules generalize?” is a nuanced “Yes, but…”

  • Yes: Autoregressive generation, scaling laws, and alignment (RLHF/GRPO) are powerful paradigms that transfer well.
  • But: You cannot ignore the physical nature of the domain. Tokenizers must respect physics (Verlet), embeddings must respect geometry (Global-DRoPE), and RL must balance safety with human-like realism.

By treating autonomous driving as a language modeling problem with physical constraints, we are moving closer to generative world models that can simulate—and eventually navigate—the complexities of the real world.