Unlocking In-Context Reinforcement Learning with Random Data — A Deep Dive into State-Action Distillation (SAD)

Foundation models like GPT have demonstrated an astonishing ability called in-context learning—the capacity to adapt to new tasks purely from examples, without updating any model parameters. This breakthrough has reshaped modern machine learning across language, vision, and multimodal domains. Now, researchers are extending this power to decision-making systems, spawning a new frontier known as In-Context Reinforcement Learning (ICRL).

The goal is simple but ambitious: build a pretrained agent that can enter a new, unseen environment and quickly learn how to act optimally by using its recent experiences—state, action, and reward tuples—as contextual hints. No gradient updates, no fine-tuning—just pure inference-driven learning.

However, despite its promise, ICRL has been hamstrung by impractical data requirements. Current state-of-the-art approaches, like Algorithm Distillation (AD) and Decision Pretrained Transformer (DPT), demand enormous volumes of high-quality data gathered from expert or even optimal policies across thousands of environments. This level of data curation is extremely costly, and in real-world scenarios—like robotics or autonomous driving—often impossible.

A new paper titled “Random Policy Enables In-Context Reinforcement Learning within Trust Horizons” proposes a paradigm-shifting solution. The authors introduce State-Action Distillation (SAD), an approach that allows pretraining entirely on data generated by random policies. Astoundingly, it works—and it could finally make ICRL practical in real-world applications.

Let’s unpack the brilliance behind SAD.

The Bottleneck of Modern ICRL

To understand SAD, we first need to clarify how ICRL typically operates—and why existing methods struggle.

ICRL reframes decision-making as a supervised learning problem. A transformer-based foundation model is trained to predict actions based on two inputs:

Context (\(\mathcal{C}\)) — A sequence of past transitions, such as (state, action, reward, next_state) tuples. This serves as the model’s in-context learning data.
Query State (\(s_q\)) — The current state for which the model must predict the best next action.

During pretraining, the model minimizes a loss over exhaustive samples of (context, query_state, action_label) triplets:

\[ \theta^* = \underset{\theta}{\operatorname{arg\,min}} \mathbb{E}_{P_{\text{train}}}\left[l\left(\mathcal{F}_{\theta}(\cdot|\mathcal{C}, s_q), a_l\right)\right]. \]

The challenge lies in creating this dataset—especially determining the action label \(a_l\). Previous methods relied on unrealistic assumptions about data availability:

Algorithm Distillation (AD): Needs the entire learning history of an RL algorithm—from random initialization to optimal convergence—as the context. It’s extremely data-hungry and only works for short episodic environments.
Decision Pretrained Transformer (DPT): Simplifies context design but demands access to optimal policies for labeling actions at every query state.
Decision Importance Transformer (DIT): Tries to operate without perfect labels by weighting context transitions based on returns-to-go. Still, it requires that over 30% of context data come from well-trained policies to achieve useful coverage.

All of these are impractical when ideal policies are unavailable or infeasible to train. SAD shatters this bottleneck: instead of requiring expert trajectories, it extracts signal from random exploration.

State-Action Distillation (SAD): Finding Order in Randomness

SAD’s key insight is both elegant and counterintuitive: even random behavior can reveal patterns of optimality, if you know where to look.

Rather than trying to imitate the random policy, SAD aims to distill the most promising state-action decisions from the raw data it generates. As shown in the conceptual workflow below, this process transforms uniform randomness into structured learning signals.

Schematic of the State-Action Distillation approach, illustrating data flow from random policy interactions through environment contexts to foundation model pretraining.

Figure 1: Pipeline of the SAD method. (i) Gather random contexts. (ii) Sample a query state. (iii) Within a trust horizon, test all possible actions and select the one yielding the highest expected return. (iv) Use these distilled examples to pretrain a foundation model.

Step 1 — Collect Random Contexts

A random (often uniform) policy interacts with several pretraining environments, collecting transition data such as (s, a, r, s'). These random transitions are used directly as the context \(\mathcal{C}\). Importantly, they don’t need to form full episodes, making data collection simple and cheap.

Step 2 — Sample a Query State

A state \(s_q\) is sampled randomly from the environment’s state space.

Step 3 — Distill an Action Label (The Magic Step)

This step is the heart of SAD. For each query state, we simulate short rollouts—one for each possible action—to evaluate which initial move leads to the highest return under subsequent random behavior.

Concretely:

For each action \(a \in A\), start a short episode (length \(N\)) where the first move is \(a\), and the rest follow a random policy.
Compute the discounted sum of rewards over that mini-episode.
The action with the highest average return becomes the distilled label \(a_l\).

The parameter \(N\) defines the Trust Horizon—a temporal window representing how far we trust random interactions to yield meaningful comparisons. Even with stochastic behavior, optimizing returns within this local horizon is surprisingly effective.

Step 4 — Supervised Pretraining

Each (context, query_state, action_label) triplet becomes a training sample. A transformer model is trained in the same autoregressive, supervised way as prior works (using losses like NLL or MSE). The only difference: SAD’s labels come from distilled random exploration, not expert policies.

Can We Really Trust a Random Policy?

This question is central. Why should the action that looks best under random exploration correlate with the true optimal action?

The answer rests on trustworthiness: the probability that SAD’s selection matches what an optimal policy would choose, given a sufficiently large trust horizon \(N\).

The authors show theoretically that for certain environments—particularly those with sparse rewards and single-goal outcomes—random policies and optimal policies often agree on which actions are best. Formally:

\[ \underset{a\in A}{\operatorname{arg\,max}} Q^{\pi}_{MDP}(s_q, a) = \underset{a\in A}{\operatorname{arg\,max}} Q^{*}_{MDP}(s_q, a), \quad \forall s_q \in S. \]

In simple grid-world navigation, for instance, both the optimal and random policies favor moves that bring the agent closer to the goal—even though everything after the first step is random.

A simple 1D grid world with five states (s0–s4). The goal (star) lies at s0; actions move left or right.

Figure 2: An example grid-world MDP. The optimal and random policies both prefer moving toward the goal state, as it maximizes expected discounted returns.

The authors formalize this observation with two theorems:

Theorem 1 (Multi-Armed Bandits): Random policies become \((1-\delta)\)-trustworthy once each arm is sampled sufficiently often—trustworthiness scales logarithmically with the trust horizon \(N\).
Theorem 2 (MDPs): For environments with discounted returns, trustworthiness increases with both the horizon length \(N\) and the number of episodes \(N_{\text{ep}}\) evaluated per action.

In short, extending the trust horizon reduces uncertainty: the longer we watch random behavior, the higher the probability that its best short-term action coincides with the true optimum.

Experimental Validation: Making Randomness Work

To test SAD, the authors evaluated it against leading ICRL algorithms—AD, DPT, DIT—and an oracle DPT* using optimal action labels. All models used identical architectures and hyperparameters to ensure fairness, and all were trained purely on random-policy data.

Two evaluation settings were considered:

Offline Evaluation: The agent acts using a fixed random-context dataset.
Online Evaluation: The agent interacts in new environments, building its own context incrementally.

Bandit Tasks

For classical Gaussian and Bernoulli bandit problems, SAD showed dramatic improvements over all baselines. It exhibited both lower suboptimality and cumulative regret, nearly matching the oracle DPT* trained with perfect labels.

Results on bandit tasks. SAD (purple) achieves the lowest suboptimality and regret across offline and online evaluations, outperforming AD, DPT, and DIT, and approaching the oracle DPT*.

Figure 3: Offline (a, c) and online (b, d) evaluations on Bandit tasks. SAD consistently dominates baselines.

Next, the team tested SAD on navigation tasks featuring sparse, single-goal rewards—classically difficult benchmarks for ICRL.

In both Darkroom and Darkroom-Large, SAD achieved dramatically higher returns in offline and online settings.

Results on Darkroom navigation tasks. SAD (purple) attains the highest returns compared with all baselines, closely following the oracle performance.

Figure 4: Offline (a, c) and online (b, d) performance on Darkroom and Darkroom-Large tasks.

The same phenomenon appeared in the high-dimensional 3D Miniworld task, which uses raw pixel inputs. SAD again outperformed other random-policy baselines, achieving about half the return of the oracle—even without fine-tuning.

Results on the 3D Miniworld task. SAD surpasses other random-policy baselines, confirming its robustness in pixel-based environments.

Figure 5: SAD retains superior performance even under complex visual conditions.

Across all five benchmarks, SAD yielded 236.3% higher performance in offline evaluation and 135.2% higher in online evaluation compared to the best baseline (DIT). These gains are extraordinary given that no expert policies were used whatsoever.

Ablation Studies: Understanding the Ingredients of SAD

To further probe the method, the authors examined the influence of two factors: the trust horizon \(N\) and transformer architecture parameters.

1. The Trust Horizon

Empirical results validated the theoretical predictions.

In bandit scenarios, performance improved monotonically with longer horizons, as additional samples refined reward estimates.
In MDPs like Darkroom, there was an optimal intermediate value—too small an \(N\) limits exploration; too large makes training unfocused. In practice, \(N=7\) worked best.

Ablation on trust horizon. For bandits (a, b), larger horizons improve results; for Darkroom (c, d), an intermediate horizon yields optimal performance.

Figure 6: Balancing the trust horizon length \(N\) is critical for maximizing SAD performance.

2. Transformer Architecture Robustness

SAD proved highly robust to variations in transformer size, including the number of attention heads and layers. This indicates that the primary benefit arises from the data generation process itself, not particular architectural tweaks.

Ablation on transformer hyperparameters. SAD maintains consistent returns across varying numbers of attention heads and layers.

Figure 7: SAD performance remains stable regardless of transformer hyperparameter settings.

Conclusion: Real-World ICRL Without Experts

State-Action Distillation (SAD) delivers what previous ICRL algorithms could not—practical, scalable training without expert policies.

By leveraging the trust horizon principle and distilling optimal actions from random rollouts, SAD transforms inexpensive, unstructured data into effective pretraining material for foundation models in reinforcement learning.

Key takeaways:

No expert data required: SAD relies purely on random policy interactions.
Distillation unlocks insight within randomness: Short-horizon rollouts reveal sufficient information for optimal decisions.
Massive performance leap: SAD outperforms all baselines trained under random data and approaches oracle-level results.
Robust and scalable: Works across discrete-action environments from bandits to 3D navigation.

While the current scope focuses on discrete actions, extending SAD to continuous domains and multi-agent scenarios is an exciting direction for future research. Ultimately, SAD suggests that learning from randomness isn’t a weakness—it’s a pathway to truly general in-context intelligence.

The Bottleneck of Modern ICRL#

State-Action Distillation (SAD): Finding Order in Randomness#

Step 1 — Collect Random Contexts#

Step 2 — Sample a Query State#

Step 3 — Distill an Action Label (The Magic Step)#

Step 4 — Supervised Pretraining#

Can We Really Trust a Random Policy?#

Experimental Validation: Making Randomness Work#

Bandit Tasks#

Navigation: Darkroom and Miniworld#

Ablation Studies: Understanding the Ingredients of SAD#

1. The Trust Horizon#

2. Transformer Architecture Robustness#

Conclusion: Real-World ICRL Without Experts#