Imagine an AI that doesn’t just follow pre-programmed instructions but learns how to play new games, solve new puzzles, or navigate unfamiliar worlds on the fly — adapting through experience rather than being told exactly what to do.
That’s the promise of In-Context Reinforcement Learning (ICRL). Instead of updating its neural weights through gradient descent, an ICRL agent adjusts its behavior dynamically based on the context — the recent sequence of its own actions, observations, and rewards — much like how large language models can learn new tasks from examples in a prompt.
However, building a scalable, truly adaptive ICRL agent isn’t easy. Traditional reinforcement learners require vast amounts of interactive data, and most existing task collections are too narrow or biased. To learn how to learn, an agent needs thousands — ideally tens of thousands — of structurally diverse tasks. Without that diversity, it merely memorizes how to handle a handful of similar puzzles instead of learning underlying principles of exploration, planning, and adaptation.
A recent paper titled “Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds” proposes a breakthrough two-part solution to this challenge:
- AnyMDP, a procedural generator capable of producing virtually endless high-quality reinforcement learning tasks with minimal structural bias.
- OmniRL, a scalable ICRL training framework that learns from this vast task universe using an elegant technique called Decoupled Policy Distillation (DPD).
Together, these innovations define a new frontier in general-purpose, adaptive AI.
The Quest for Agents That Learn How to Learn
The notion of “learning to learn” lies at the heart of meta-learning, where a model is trained not for one problem but for many. The idea is to expose a system to diverse tasks so that it internalizes a flexible learning algorithm — one that can, during inference, adapt to new problems without additional training.
Large language models demonstrate this brilliantly: the pre-training phase acts as meta-training, enabling them to perform new tasks simply by reading context examples. ICRL applies the same principle to decision-making and control. Instead of words, the context is made up of trajectories of experience — sequences of states, actions, and rewards — allowing an agent to adapt its policy online, guided only by the environment’s feedback.
But there’s a critical limitation. If the training set of tasks shares hidden similarities, the model doesn’t learn to learn; it learns to recognize patterns. The challenge is then to craft environments that are both diverse and unbiased enough to push the model beyond memorization and toward true adaptive reasoning.
Part 1: AnyMDP — A Universe of Procedurally Generated Reinforcement Tasks
The first pillar of this work, AnyMDP, is a framework for procedurally generating large-scale collections of discrete Markov Decision Processes (MDPs) — the mathematical foundation of almost every reinforcement learning problem.
The ingenuity of AnyMDP lies in how it enforces structure without bias. Each generated task must satisfy three critical properties that make training challenging yet meaningful:
1. Ergodicity
Every MDP must be ergodic — all states are reachable, ensuring no task is trivially disconnected or unsolvable. This guarantees a rich space for exploration and prevents degenerate dynamics.
2. Banded Transition Matrix
Imagine ranking all states from “easiest” to “hardest” to reach. AnyMDP enforces a banded transition matrix, meaning the agent can only move between states that are “nearby” in this ranking. This constraint ensures a natural progression: the agent must earn its way to more challenging states rather than skip directly to high-value goals.
Mathematically, it guarantees a directional drift through intermediate states — a built-in curriculum that avoids trivial shortcuts.
3. Ascending Value Function
Lastly, Each AnyMDP is designed so that the value of final goal states is significantly higher than that of any initial state, ensuring the agent always has something worthwhile to strive for.
Together, these design principles force meaningful learning dynamics: to reach higher-value states, an agent must develop strategic exploration. A theorem in the paper confirms that the probability of a random agent reaching such states decays exponentially, making blind trial-and-error virtually impossible.

Figure 1: AnyMDP tasks (red) are consistently harder and slower to solve than Garnet benchmarks (blue/cyan). They require intelligent exploration rather than brute-force repetition.
Beyond Randomness: Validating AnyMDP’s Difficulty
To verify AnyMDP’s effectiveness, the authors compared learning curves of standard RL algorithms — Tabular Q-Learning (TQL) and Proximal Policy Optimization (PPO) — on AnyMDP and prior procedural benchmarks like Garnet.
In all trials, AnyMDP tasks proved significantly harder to master, validating that they produce genuinely non-trivial challenges. The result: AnyMDP doesn’t just create random tasks; it generates problems that demand real learning.
The paper further examined the stationary distribution of states — how likely each state is to be visited under random and optimal policies.

Figure 2: AnyMDP produces exponentially decaying stationary distributions. High-value states are rarely visited randomly, confirming that the environment encourages structured exploration.
The distinct exponential decay observed for AnyMDP tasks empirically supports the theoretical guarantee: exploration must be intelligent, not arbitrary.
Part 2: OmniRL — Scaling Up In-Context Reinforcement Learning
Creating an endless world of tasks is only half the battle. Training a model to thrive in it requires an efficient, scalable framework — one that avoids the pitfalls of traditional RL approaches.
Meta-training via reinforcement learning or evolutionary strategies is notoriously slow and hardware-intensive. Recent supervised alternatives, like Algorithm Distillation (AD) and Decision Pre-Training Transformers (DPT), improve sample efficiency but introduce severe distribution shift problems. During training, models imitate expert trajectories; during inference, they generate their own — and when those differ, performance collapses.
Enter Decoupled Policy Distillation (DPD).

Figure 3: DPD separates behavior policy (trajectory generator) from reference policy (training target), resolving the distribution-shift problem and enabling much greater diversity.
DPD mitigates this by breaking the feedback loop between what the model trains on and what it imitates.
Two Policies, Two Roles
Behavior Policy (πᵇ) — generates the trajectories the model learns from. Instead of relying on a single expert, OmniRL uses a rich collection of behavior policies: oracle, tabular Q-learning, model-based RL, random strategies, and even “noisy” versions of these.
Reference Policy (π*) — defines the target actions to imitate. Regardless of the behavior’s quality, the model always learns toward the optimal oracle policy.
This separation creates both diversity and stability. Diverse behaviors expose the model to varied situations, reducing sensitivity to unseen contexts. Yet imitation of the oracle ensures convergence toward optimality.
Using Prior Knowledge and Chunkwise Training
Each action in the trajectory is tagged with metadata indicating its origin policy type — giving the model helpful background context when interpreting mixed behavior types.
To handle extremely long sequences (up to 512,000 steps per context), OmniRL adopts chunkwise training, processing and updating in segments to maintain efficiency while preserving temporal coherence.

Figure 4: The OmniRL architecture encodes long sequences causally, predicting oracle actions at each step while training over chunked segments for scalability.
Experiments: Toward a General-Purpose Learning Agent
The researchers trained OmniRL exclusively on AnyMDP tasks. Then they tested it on a wide array of environments it had never seen — including unseen AnyMDP variants, standard Gymnasium environments, and even DarkRoom and multi-agent simulation tasks.
The results speak volumes.

Table 1 (excerpt): OmniRL achieves high performance on unseen tasks with dramatically fewer interaction steps. A version trained on Garnet tasks fails to generalize, underscoring the importance of task diversity.
OmniRL trained on AnyMDP achieves top-tier performance across nearly all unseen environments, with orders of magnitude better sample efficiency than PPO and TQL-UCB. Surprisingly, despite never training on multi-agent scenarios, OmniRL demonstrates emergent cooperative behavior — evidence of genuine transfer learning between task structures.
Learning curves further illustrate OmniRL’s rapid adaptation and stable convergence, outperforming both baselines across evaluation episodes.

Figure 5: OmniRL adapts quickly and efficiently on unseen tasks, reflecting true in-context reinforcement learning.
Why Task Diversity Changes Everything
The authors ran a revealing study to test how the number of unique training tasks impacts generalization. They trained four models with datasets containing 100, 1K, 10K, and 128K unique tasks.

Figure 7: Only when trained on ≥10K distinct tasks does broad generalization emerge. Fewer tasks lead to fast mastery of seen cases but poor performance on new ones.
The patterns were striking:
- Small-scale training (≤1K tasks): The model quickly learns but overfits, excelling only on familiar tasks — essentially performing “task recognition.”
- Large-scale training (≥10K tasks): Memorization becomes impossible, forcing the model to acquire general learning capabilities. This marks the emergence of true in-context reinforcement learning.
An additional observation: as task diversity grows, the model requires longer trajectories to adapt — a “tax” for generality. Short contexts measure quick memorization; long contexts reveal genuine adaptability. This insight suggests future evaluations should emphasize asymptotic performance, not just few-shot adaptation.
Conclusion: A Path Toward Truly Adaptive AI
The combination of AnyMDP and OmniRL demonstrates that scaling diversity and optimizing training workflows can yield genuinely general-purpose learning systems.
Key takeaways:
Task diversity is essential. Generalization arises only when models experience a wide range of structurally distinct problems. Beyond a certain threshold of diversity, memorization gives way to abstract reasoning.
Longer context is the cost of generalization. General-purpose learners may adapt more slowly initially, but their ability to sustain improvement with experience represents true learning capability.
Scalability depends on efficient meta-training. Decoupled Policy Distillation and chunkwise causal modeling enable learning at unprecedented scale without crippling computational costs.
This work charts a practical roadmap toward AI agents that learn not just what to do but how to learn — evolving from static problem-solvers into dynamic, continually adaptive intelligences.
By generating more diverse worlds instead of narrowly defined tasks, and by building systems that connect experience and reasoning across millions of steps, we edge one step closer to general-purpose AI that thrives in the unknown.
](https://deep-paper.org/en/paper/2502.02869/images/cover.png)