Decision Transformer: When Language Models Learn to Play Games

What if you could tackle a complex reinforcement learning problem the same way you’d complete a sentence? This is the radical and powerful idea behind the Decision Transformer—a paper that reframes the entire field of sequential decision-making. For decades, Reinforcement Learning (RL) has been dominated by algorithms that learn value functions and policy gradients, often wrestling with complex issues like temporal credit assignment, bootstrapping instability, and discounting. But what if we could sidestep all of that?

The past few years have shown the incredible power of Transformer architectures. Models like GPT-3 can write poetry, code, and essays by simply predicting the next token in a sequence. This success in language modeling has led researchers to ask a profound question: Can the simple, scalable, and powerful paradigm of sequence modeling be applied to RL?

The authors of Decision Transformer: Reinforcement Learning via Sequence Modeling answer with a resounding yes. They propose a framework that treats an entire RL trajectory—states, actions, and returns—as a sequence of tokens, much like a sentence. By doing this, they can train a GPT-style model to “complete” the sequence by generating the right actions, conditioned on a desired outcome. This elegant approach not only works but matches or exceeds the performance of state-of-the-art methods on challenging benchmarks—without the traditional machinery of RL like Bellman backups, temporal difference learning, or explicit value functions.

An Intuitive Example

Imagine trying to find the shortest path in a graph. Traditional RL might explore, assign values to nodes, and slowly propagate those values back to earlier states. The Decision Transformer approach is different. It’s like training on thousands of transcripts of random walks through the graph, where each transcript records the path taken and the eventual outcome length. At test time, you prompt the model: “Show me a path with the shortest possible length.” Leveraging patterns it has learned between sequences and outcomes, the model generates an optimal set of moves.

A diagram showing how a graph problem can be framed for a sequence model, with panels for the graph, random walk dataset, and generated optimal path.

Figure 1: An intuitive example of the Decision Transformer concept. The model is trained on random walks through a graph (middle) and can then be prompted to generate an optimal shortest path to the goal (right) by conditioning on a high desired return.

In this article, we’ll explore how the Decision Transformer works. We’ll cover offline RL and the Transformer architecture, break down the model’s design step-by-step, and analyze experimental results that make this paper a milestone in reinforcement learning research.

Background: Setting the Stage

To appreciate the Decision Transformer, we need to understand two key concepts: Offline Reinforcement Learning and Transformers.

Offline Reinforcement Learning

In classic (“online”) RL, an agent learns by actively interacting with its environment—trying actions, observing rewards, and updating its policy via trial-and-error. While powerful, this can be slow, costly, or unsafe in real-world settings such as robotics or autonomous driving.

Offline RL offers a different paradigm: the agent learns purely from a fixed, pre-collected dataset of trajectories. This dataset could contain expert demonstrations, suboptimal policies, or even random exploration. The agent must extract the best possible policy without any further data collection.

This is challenging because the agent might encounter states unfamiliar from the dataset if it chooses poorly. Value estimates in these unseen states can be highly inaccurate, causing policy collapse. Many offline RL algorithms tackle this via policy constraints (restricting actions to be similar to dataset actions) or value pessimism (intentionally underestimating unseen state-action pairs). As we’ll see, the Decision Transformer sidesteps these complications entirely.

Transformers: The Engine of Modern AI

Transformers excel at modeling sequential data through the self-attention mechanism. For each token in a sequence, the model computes:

Query (Q): “What information am I looking for?”
Key (K): “What information do I have?”
Value (V): “What information can I provide?”

It compares each token’s Query vector to all others’ Keys, producing relevance scores via a dot product. These scores weight the Values to produce a weighted sum:

\[ z_{i} = \sum_{j=1}^{n} \operatorname{softmax}(\{\langle q_{i}, k_{j'} \rangle\}_{j'=1}^{n})_{j} \cdot v_{j} \]

This enables the model to connect and assign credit between distant sequence elements—essential for both language understanding and RL credit assignment.

The Decision Transformer uses a GPT-style architecture with a causal mask in self-attention. This forces predictions for the current token to depend only on previous tokens—perfect for generating actions step-by-step.

The Core Method: RL as Sequence Modeling

The Decision Transformer reframes RL’s goal: instead of directly learning a policy \(\pi(a|s)\), it models the joint distribution of trajectories and uses conditional generation to produce actions leading to a target outcome.

A New Trajectory Representation

A standard RL trajectory is:
\((s_1, a_1, r_1, s_2, a_2, r_2, \dots)\)

Instead of raw rewards \(r_t\), the Decision Transformer uses the return-to-go:

\[ \hat{R}_t = \sum_{t'=t}^T r_{t'} \]

Trajectory tokens become:

\[ \tau = (\widehat{R}_1, s_1, a_1, \widehat{R}_2, s_2, a_2, \dots, \widehat{R}_T, s_T, a_T) \]

Given a state \(s_t\) and desired future return \(\widehat{R}_t\), it learns: “Which action \(a_t\) will get me there?”

Architecture

The Decision Transformer architecture showing state, action, and return embeddings fed into a causal transformer to predict the next action.

Figure 2: The Decision Transformer architecture. Returns-to-go, states, and actions are embedded and given timestep positional encodings. A GPT-style causal transformer predicts the next action.

Steps:

Embedding Inputs: Last \(K\) timesteps yield \(3K\) tokens (\(\widehat{R}\), \(s\), \(a\)) embedded via modality-specific layers (CNN for images, linear for others).
Timestep Positional Encoding: Learned per-timestep embeddings are added so that \(\widehat{R}_t\), \(s_t\), and \(a_t\) align temporally.
Causal Transformer: GPT processes the embedded sequence with masked self-attention.
Prediction Head: The output for each state token predicts its corresponding action (cross-entropy for discrete, MSE for continuous).

Test-Time Generation

To use the model:

Set a target return (e.g., expert score).
Observe the initial state \(s_1\).
Input (\(target\_return, s_1\)) → predict \(a_1\).
Execute \(a_1\), receive new state \(s_2\) & reward \(r_1\).
Update return: \(R_{new} = target\_return - r_1\).
Append to sequence and repeat until episode ends.

By chasing the declining target return, the model produces actions leading toward the initial goal.

Experiments and Results

Benchmarks: Atari (visual, discrete), OpenAI Gym/D4RL (continuous control), and Key-to-Door (long-term credit assignment). Comparisons: Conservative Q-Learning (CQL), Behavior Cloning (BC).

Grouped bar chart comparing Decision Transformer (blue), TD Learning (green), and Behavior Cloning (orange) across benchmarks.

Figure 3: Across diverse tasks, Decision Transformer matches or surpasses TD-learning and imitation learning baselines.

Atari and Continuous Control

Atari 1% Dataset: Gamer-normalized scores:

Game	DT (Ours)	CQL	QR-DQN	REM	BC
Breakout	267.5 ± 97.5	211.1	17.1	8.9	138.9 ± 61.7
Qbert	15.4 ± 11.4	104.2	0.0	0.0	17.3 ± 14.7
Pong	106.1 ± 8.1	111.9	18.0	0.5	85.2 ± 20.0
Seaquest	2.5 ± 0.4	1.7	0.4	0.7	2.1 ± 0.3

D4RL Continuous Control:

Dataset	Env	DT (Ours)	CQL	BEAR	BRAC-v	AWR	BC
Medium-Exp	HalfCheetah	86.8 ± 1.3	62.4	53.4	41.9	52.7	59.9
Medium-Exp	Hopper	107.6 ± 1.8	111.0	96.3	0.8	27.1	79.6
Medium-Exp	Walker	108.1 ± 0.2	98.7	40.1	81.6	53.8	36.6
Medium	Hopper	67.6 ± 1.0	58.0	52.1	31.1	35.9	63.9
Medium-Rep	Hopper	82.7 ± 7.0	48.6	33.7	0.6	28.4	27.6
Medium-Rep	Walker	66.6 ± 3.0	26.7	19.2	0.9	15.5	36.9

DT is consistently competitive or superior.

Is It Just Cloning the Best?

Percentile Behavior Cloning (%BC) trains only on the top X% trajectories.

Low-data regimes (Atari 1%) show %BC is weak versus DT, suggesting DT learns from the entire distribution—good and bad trajectories alike.

Controllability

Plots showing achieved returns versus target returns across environments.

Figure 4: Achieved returns (blue) closely follow desired target returns (green). Orange marks best dataset trajectory.

DT can match desired returns, even exceeding dataset maxima (e.g., Seaquest) by stitching optimal segments from multiple trajectories.

Context and Long-Term Credit

Short contexts (\(K=1\)) degrade performance. Longer contexts enhance learning of policy styles and skills.

Key-to-Door:
Agent must pick key in phase 1, traverse distractor phase 2, and open door in phase 3 to be rewarded.

Dataset	DT	CQL	BC	%BC	Random
1K Random Traj	71.8%	13.1%	1.4%	69.9%	3.1%
10K Random Traj	94.6%	13.3%	1.6%	95.1%	3.1%

Plots of predicted reward probabilities and attention weights showing focus on critical events.

Figure 5: (Left) Model’s success probability spikes after key pickup. (Right) Attention focuses sharply on key and door events, establishing direct credit assignment.

Self-attention links critical events across time without slow temporal difference propagation.

Conclusion and Implications

The Decision Transformer marks a paradigm shift—by framing RL as a conditional sequence modeling problem, it replaces decades of complex RL algorithms with a simple, scalable Transformer.

Key takeaways:

Simplicity: No value functions or policy gradients—just supervised learning over sequences.
Context: Long histories enable effortless long-term credit assignment.
Control: Desired returns act as a prompt, making policies controllable and even capable of extrapolation.

This approach could lead to RL foundation models trained on diverse behavioral datasets, then fine-tuned for specific tasks. Extending DT to online learning could merge strong behavioral modeling with active exploration.

The Decision Transformer doesn’t just offer an algorithm—it offers a new lens: the future of general, capable agents may depend less on estimating values and more on learning the language of action.

An Intuitive Example#

Background: Setting the Stage#

Offline Reinforcement Learning#

Transformers: The Engine of Modern AI#

The Core Method: RL as Sequence Modeling#

A New Trajectory Representation#

Architecture#

Test-Time Generation#

Experiments and Results#

Atari and Continuous Control#

Is It Just Cloning the Best?#

Controllability#

Context and Long-Term Credit#

Conclusion and Implications#