What if you could tackle a complex reinforcement learning problem the same way you’d complete a sentence? This is the radical and powerful idea behind the Decision Transformer—a paper that reframes the entire field of sequential decision-making. For decades, Reinforcement Learning (RL) has been dominated by algorithms that learn value functions and policy gradients, often wrestling with complex issues like temporal credit assignment, bootstrapping instability, and discounting. But what if we could sidestep all of that?
The past few years have shown the incredible power of Transformer architectures. Models like GPT-3 can write poetry, code, and essays by simply predicting the next token in a sequence. This success in language modeling has led researchers to ask a profound question: Can the simple, scalable, and powerful paradigm of sequence modeling be applied to RL?
The authors of Decision Transformer: Reinforcement Learning via Sequence Modeling answer with a resounding yes. They propose a framework that treats an entire RL trajectory—states, actions, and returns—as a sequence of tokens, much like a sentence. By doing this, they can train a GPT-style model to “complete” the sequence by generating the right actions, conditioned on a desired outcome. This elegant approach not only works but matches or exceeds the performance of state-of-the-art methods on challenging benchmarks—without the traditional machinery of RL like Bellman backups, temporal difference learning, or explicit value functions.
An Intuitive Example
Imagine trying to find the shortest path in a graph. Traditional RL might explore, assign values to nodes, and slowly propagate those values back to earlier states. The Decision Transformer approach is different. It’s like training on thousands of transcripts of random walks through the graph, where each transcript records the path taken and the eventual outcome length. At test time, you prompt the model: “Show me a path with the shortest possible length.” Leveraging patterns it has learned between sequences and outcomes, the model generates an optimal set of moves.
Figure 1: An intuitive example of the Decision Transformer concept. The model is trained on random walks through a graph (middle) and can then be prompted to generate an optimal shortest path to the goal (right) by conditioning on a high desired return.
In this article, we’ll explore how the Decision Transformer works. We’ll cover offline RL and the Transformer architecture, break down the model’s design step-by-step, and analyze experimental results that make this paper a milestone in reinforcement learning research.
Background: Setting the Stage
To appreciate the Decision Transformer, we need to understand two key concepts: Offline Reinforcement Learning and Transformers.
Offline Reinforcement Learning
In classic (“online”) RL, an agent learns by actively interacting with its environment—trying actions, observing rewards, and updating its policy via trial-and-error. While powerful, this can be slow, costly, or unsafe in real-world settings such as robotics or autonomous driving.
Offline RL offers a different paradigm: the agent learns purely from a fixed, pre-collected dataset of trajectories. This dataset could contain expert demonstrations, suboptimal policies, or even random exploration. The agent must extract the best possible policy without any further data collection.
This is challenging because the agent might encounter states unfamiliar from the dataset if it chooses poorly. Value estimates in these unseen states can be highly inaccurate, causing policy collapse. Many offline RL algorithms tackle this via policy constraints (restricting actions to be similar to dataset actions) or value pessimism (intentionally underestimating unseen state-action pairs). As we’ll see, the Decision Transformer sidesteps these complications entirely.
Transformers: The Engine of Modern AI
Transformers excel at modeling sequential data through the self-attention mechanism. For each token in a sequence, the model computes:
- Query (Q): “What information am I looking for?”
- Key (K): “What information do I have?”
- Value (V): “What information can I provide?”
It compares each token’s Query vector to all others’ Keys, producing relevance scores via a dot product. These scores weight the Values to produce a weighted sum:
\[ z_{i} = \sum_{j=1}^{n} \operatorname{softmax}(\{\langle q_{i}, k_{j'} \rangle\}_{j'=1}^{n})_{j} \cdot v_{j} \]This enables the model to connect and assign credit between distant sequence elements—essential for both language understanding and RL credit assignment.
The Decision Transformer uses a GPT-style architecture with a causal mask in self-attention. This forces predictions for the current token to depend only on previous tokens—perfect for generating actions step-by-step.
The Core Method: RL as Sequence Modeling
The Decision Transformer reframes RL’s goal: instead of directly learning a policy \(\pi(a|s)\), it models the joint distribution of trajectories and uses conditional generation to produce actions leading to a target outcome.
A New Trajectory Representation
A standard RL trajectory is:
\((s_1, a_1, r_1, s_2, a_2, r_2, \dots)\)
Instead of raw rewards \(r_t\), the Decision Transformer uses the return-to-go:
\[ \hat{R}_t = \sum_{t'=t}^T r_{t'} \]Trajectory tokens become:
\[ \tau = (\widehat{R}_1, s_1, a_1, \widehat{R}_2, s_2, a_2, \dots, \widehat{R}_T, s_T, a_T) \]Given a state \(s_t\) and desired future return \(\widehat{R}_t\), it learns: “Which action \(a_t\) will get me there?”
Architecture
Figure 2: The Decision Transformer architecture. Returns-to-go, states, and actions are embedded and given timestep positional encodings. A GPT-style causal transformer predicts the next action.
Steps:
- Embedding Inputs: Last \(K\) timesteps yield \(3K\) tokens (\(\widehat{R}\), \(s\), \(a\)) embedded via modality-specific layers (CNN for images, linear for others).
- Timestep Positional Encoding: Learned per-timestep embeddings are added so that \(\widehat{R}_t\), \(s_t\), and \(a_t\) align temporally.
- Causal Transformer: GPT processes the embedded sequence with masked self-attention.
- Prediction Head: The output for each state token predicts its corresponding action (cross-entropy for discrete, MSE for continuous).
Test-Time Generation
To use the model:
- Set a target return (e.g., expert score).
- Observe the initial state \(s_1\).
- Input (\(target\_return, s_1\)) → predict \(a_1\).
- Execute \(a_1\), receive new state \(s_2\) & reward \(r_1\).
- Update return: \(R_{new} = target\_return - r_1\).
- Append to sequence and repeat until episode ends.
By chasing the declining target return, the model produces actions leading toward the initial goal.
Experiments and Results
Benchmarks: Atari (visual, discrete), OpenAI Gym/D4RL (continuous control), and Key-to-Door (long-term credit assignment). Comparisons: Conservative Q-Learning (CQL), Behavior Cloning (BC).
Figure 3: Across diverse tasks, Decision Transformer matches or surpasses TD-learning and imitation learning baselines.
Atari and Continuous Control
Atari 1% Dataset: Gamer-normalized scores:
Game | DT (Ours) | CQL | QR-DQN | REM | BC |
---|---|---|---|---|---|
Breakout | 267.5 ± 97.5 | 211.1 | 17.1 | 8.9 | 138.9 ± 61.7 |
Qbert | 15.4 ± 11.4 | 104.2 | 0.0 | 0.0 | 17.3 ± 14.7 |
Pong | 106.1 ± 8.1 | 111.9 | 18.0 | 0.5 | 85.2 ± 20.0 |
Seaquest | 2.5 ± 0.4 | 1.7 | 0.4 | 0.7 | 2.1 ± 0.3 |
D4RL Continuous Control:
Dataset | Env | DT (Ours) | CQL | BEAR | BRAC-v | AWR | BC |
---|---|---|---|---|---|---|---|
Medium-Exp | HalfCheetah | 86.8 ± 1.3 | 62.4 | 53.4 | 41.9 | 52.7 | 59.9 |
Medium-Exp | Hopper | 107.6 ± 1.8 | 111.0 | 96.3 | 0.8 | 27.1 | 79.6 |
Medium-Exp | Walker | 108.1 ± 0.2 | 98.7 | 40.1 | 81.6 | 53.8 | 36.6 |
Medium | Hopper | 67.6 ± 1.0 | 58.0 | 52.1 | 31.1 | 35.9 | 63.9 |
Medium-Rep | Hopper | 82.7 ± 7.0 | 48.6 | 33.7 | 0.6 | 28.4 | 27.6 |
Medium-Rep | Walker | 66.6 ± 3.0 | 26.7 | 19.2 | 0.9 | 15.5 | 36.9 |
DT is consistently competitive or superior.
Is It Just Cloning the Best?
Percentile Behavior Cloning (%BC) trains only on the top X% trajectories.
Low-data regimes (Atari 1%) show %BC is weak versus DT, suggesting DT learns from the entire distribution—good and bad trajectories alike.
Controllability
Figure 4: Achieved returns (blue) closely follow desired target returns (green). Orange marks best dataset trajectory.
DT can match desired returns, even exceeding dataset maxima (e.g., Seaquest) by stitching optimal segments from multiple trajectories.
Context and Long-Term Credit
Short contexts (\(K=1\)) degrade performance. Longer contexts enhance learning of policy styles and skills.
Key-to-Door:
Agent must pick key in phase 1, traverse distractor phase 2, and open door in phase 3 to be rewarded.
Dataset | DT | CQL | BC | %BC | Random |
---|---|---|---|---|---|
1K Random Traj | 71.8% | 13.1% | 1.4% | 69.9% | 3.1% |
10K Random Traj | 94.6% | 13.3% | 1.6% | 95.1% | 3.1% |
Figure 5: (Left) Model’s success probability spikes after key pickup. (Right) Attention focuses sharply on key and door events, establishing direct credit assignment.
Self-attention links critical events across time without slow temporal difference propagation.
Conclusion and Implications
The Decision Transformer marks a paradigm shift—by framing RL as a conditional sequence modeling problem, it replaces decades of complex RL algorithms with a simple, scalable Transformer.
Key takeaways:
- Simplicity: No value functions or policy gradients—just supervised learning over sequences.
- Context: Long histories enable effortless long-term credit assignment.
- Control: Desired returns act as a prompt, making policies controllable and even capable of extrapolation.
This approach could lead to RL foundation models trained on diverse behavioral datasets, then fine-tuned for specific tasks. Extending DT to online learning could merge strong behavioral modeling with active exploration.
The Decision Transformer doesn’t just offer an algorithm—it offers a new lens: the future of general, capable agents may depend less on estimating values and more on learning the language of action.