Introduction
In the last few years, the landscape of Artificial Intelligence has been dominated by one specific architecture: the Transformer. Large Language Models (LLMs) like GPT-4 and Llama have revolutionized how machines process sequence data, demonstrating reasoning capabilities that seem almost emergent.
This naturally leads to a provocative question for researchers in other fields: If a model is excellent at predicting the next word in a sentence, can it be equally good at predicting the next move of a car in traffic?
Superficially, the domains are strikingly similar. Both involve autoregressive sequence modeling (predicting \(t+1\) given \(0...t\)), token-based representations, and heavy reliance on context. In language, context is the paragraph you just wrote; in driving, context is the road geometry and the behavior of other drivers.
However, copying and pasting architecture from one domain to another rarely works out of the box. A road is not a sentence, and a trajectory is not a string of text.
In this post, we are deep-diving into a fascinating paper titled “Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving.” The researchers conducted a systematic evaluation of five core LLM components—Tokenizers, Positional Embeddings, Pre-training, Post-training, and Test-time Computing—to see what transfers to autonomous driving and, crucially, what breaks.

As illustrated in Figure 1, the high-level pipelines are almost identical. But as we will discover, the devil is in the details.
The Task: Motion Generation
Before dissecting the architecture, we must define the problem. The authors focus on Motion Generation, specifically the “Sim Agents” task.
In this setting, the model is given:
- Context (\(C_t\)): The static environment (HD maps, traffic lights, road boundaries).
- History (\(H_i^t\)): The past 1 second of motion for all agents (cars, pedestrians, cyclists).
The goal is to generate realistic, 8-second future trajectories (\(\hat{Y}_i\)) for multiple agents simultaneously. This is mathematically formulated as:

This isn’t just about predicting one likely path. A good simulator needs to generate diverse but plausible rollouts—32 different ways the future could unfold, all obeying physics and traffic laws.
The Core Method: Adapting LLM Modules
The researchers built a GPT-style baseline model to test their hypotheses. This model consists of two main parts: a Scene Encoder and a Motion Generator.
The Architecture
The Scene Encoder compresses the map and agent history into tokens. Unlike a standard text encoder, this network has to process geometric data (points and vectors). As shown below, it uses a PointNet-like structure to embed agents and map elements, followed by attention layers to mix information between agents (A2A), maps (M2A), and vice versa.

These embeddings are then fed into the Motion Generator, which acts as the “GPT” of the system. It uses an autoregressive decoder to predict the next motion token. Notice the complex attention mechanism: it doesn’t just look at previous tokens (Temporal Attention); it also attends to other agents and the map at every step.

With this baseline established, let’s look at the five modules the authors investigated.
Module 1: Tokenizing
The Problem: LLMs operate on discrete vocabularies (words or sub-words). Vehicle motion is continuous (velocity, acceleration, position). How do you turn a continuous curve into a discrete token ID?
The Approaches:
- Data-Driven (e.g., SMART): Cluster millions of real-world trajectories into, say, 1,024 “standard” shapes. This is like creating a dictionary of common phrases.
- Model-Based (Verlet): Discretize the physics (acceleration/velocity) rather than the raw trajectory shapes.
The authors propose a Verlet-Agent tokenizer. They transform trajectories into the agent’s local coordinate frame and discretize the acceleration space. This differs from prior work (MotionLM) which used a scene-centric frame.
Why does this matter? If you tokenize based on global coordinates, “accelerating north” and “accelerating east” look like different tokens. If you use agent-centric coordinates, “accelerating forward” is the same token regardless of which way the car is facing. This consistency improves generalization.

As Figure 8 shows, the Verlet wrapper ensures that a token represents a change in motion state (acceleration), ensuring physical consistency regardless of initial velocity.
The Verdict: The agent-centric model-based approach outperforms the data-driven clustering approach significantly. It achieves higher accuracy with a much smaller vocabulary size (169 tokens vs 2048), as seen in Table 1.

Module 2: Positional Embedding (PE)
The Problem: In NLP, position is 1D (word 1, word 2, word 3). In driving, position is 2D/3D (x, y, heading) and relational. You need to know that Car A is behind Car B and to the left of the Lane Marker.
Standard NLP positional embeddings (Sine/Cosine functions added to the input) fail here because they don’t capture the relative geometry between 2D objects effectively.
The Solution: The authors adapt DRoPE (Directional Rotary Positional Embedding). Instead of adding a vector, RoPE rotates the embedding in high-dimensional space. The angle of rotation corresponds to the position/orientation.

The math for rotating based on position and heading looks like this:

The Twist: The authors found that standard DRoPE had a flaw. If you encode every lane segment in its own local coordinate system, two parallel lanes look mathematically identical to the network. The attention mechanism struggles to distinguish them.
They propose Global-DRoPE. They keep the coordinates in the global frame (preserving the semantic distinction between “left lane” and “right lane”) but use the Rotary mechanism to inject relative position information during the attention calculation.

Look at Figure 11. The top row (Local coordinates) shows high redundancy—the features look the same. The bottom row (Global coordinates) shows smooth transitions and distinct features for different parts of the intersection. This leads to better trajectory generation, as seen in Figure 12 below.

Module 3: Pre-training and Scaling Laws
The Question: One of the defining characteristics of LLMs is the Scaling Law: adding more data and parameters predictably lowers the loss. Does this hold for driving?
The Experiment: The authors trained models of varying sizes (Mini to Large) on datasets ranging from 10% to 800% (augmented) of the Waymo Open Motion Dataset.
The Verdict: Yes, scaling laws apply. As shown in Figure 9, loss decreases linearly (on a log-log scale) as data increases. However, they hit a ceiling: the “Large” model started overfitting even on the 800% dataset, suggesting that simply augmenting data isn’t enough—we need genuinely larger datasets (like all of Waymo or nuPlan) to push parameters further.

Qualitatively, the pre-trained model learns complex behaviors like U-turns and yielding to pedestrians (Figure 13).

Module 4: Post-training (Alignment)
The Problem: A pre-trained model (trained via imitation learning) is like a text completion model—it mimics the average human. But “average” human driving isn’t always safe or optimal. We need to align the model, just like we use RLHF (Reinforcement Learning from Human Feedback) to stop ChatGPT from being toxic.
The Approaches: The authors compared three alignment techniques:
- SFT (Supervised Fine-Tuning): Fine-tuning on safety-critical cases.
- REINFORCE / A2C: Standard Reinforcement Learning methods maximizing a reward function (penalizing collisions).
- GRPO (Group Relative Policy Optimization): A newer method used in LLMs (like DeepSeek-Math).
Why GRPO? Standard RL (A2C) requires training a “Critic” network to estimate the value of a state. This is notoriously unstable in high-dimensional driving tasks. GRPO skips the Critic. instead, it generates a group of \(N\) trajectories for the same input, ranks them, and updates the policy to favor the better ones in the group relative to the average.

The GRPO loss function includes a KL-divergence term to ensure the model doesn’t drift too far from the original human-like physics (Equation below).

The Verdict: GRPO wins. As shown in Table 4, it achieves the best balance. Standard RL (REINFORCE/A2C) creates overly conservative “robot-like” drivers that refuse to move to avoid collisions (low Realism score). GRPO maintains human-like realism while significantly improving safety.

The difference is stark in the heatmaps below. SFT changes little. REINFORCE converges to a single safe path (boring/predictable). GRPO explores a safe distribution of paths.

Figure 14 shows specific cases where GRPO solves “failed” scenarios from the pre-trained model, such as safely merging or yielding to pedestrians.

Module 5: Test-time Computing
The Problem: Even the best model occasionally outputs a collision. In LLMs, we use techniques like “Chain of Thought” or “Best-of-N” sampling to improve output during inference.
The Solution: The authors implement a Generate-Search-Cluster pipeline.
- Generate: Produce many rollouts (e.g., 1024) in parallel.
- Search: Filter out any trajectories that collide with the map or agents.
- Cluster: Use K-Medoids to group the remaining safe trajectories and pick diverse representatives.
The Verdict: This is the single most effective way to ensure safety. As Figure 15 visualizes, raw outputs have many “red X” collisions. The “Search + Cluster” approach almost eliminates them.

Results and Limitations
Combining all these optimized modules (Verlet-Agent Tokenizer + Global-DRoPE + GRPO + Search/Cluster), the authors submitted their model to the Waymo Sim Agents Leaderboard.
They achieved competitive results against State-of-the-Art (SOTA) methods, as seen in Table 6.

The Evaluation Paradox
The authors highlight a critical flaw in current benchmarks. The standard metric is “Realism Likelihood,” which measures how close the generated trajectory is to the ground truth.
But what if the ground truth is bad?
In Figure 4, the human driver (Ground Truth) actually went off-road due to noise or error. The pre-trained model mimicked this error and got a high score. The optimized (GRPO) model fixed the error and stayed on the road, but got a low score because it didn’t match the ground truth. This indicates that as AD models get better than humans, our metrics need to evolve.

Conclusion
This paper provides a roadmap for adapting the “LLM playbook” to autonomous driving. The answer to “Do LLM modules generalize?” is a nuanced “Yes, but…”
- Yes: Autoregressive generation, scaling laws, and alignment (RLHF/GRPO) are powerful paradigms that transfer well.
- But: You cannot ignore the physical nature of the domain. Tokenizers must respect physics (Verlet), embeddings must respect geometry (Global-DRoPE), and RL must balance safety with human-like realism.
By treating autonomous driving as a language modeling problem with physical constraints, we are moving closer to generative world models that can simulate—and eventually navigate—the complexities of the real world.
](https://deep-paper.org/en/paper/2509.02754/images/cover.png)