Introduction: The AI Agent’s Dilemma
Imagine an AI agent that can book your travel on websites, manage your schedule by interacting with digital tools, or even navigate complex e-commerce platforms for you. This is the promise of autonomous agents powered by large language models (LLMs). They can reason, plan, and act with impressive versatility. Yet, to truly excel in real-world tasks, they must learn from experience—just as humans do.
The natural choice for this kind of interactive learning is reinforcement learning (RL), where agents improve through trial and error, receiving rewards for good actions and penalties for poor ones. This approach is what trained AI systems to master games like Go. However, applying RL to real-world LLM agents has proven difficult and expensive.
In traditional setups, shown in Figure 1a, the agent interacts directly with a real environment. This process is slow, costly, and fragile.

Figure 1a. Traditional agent learning suffers from limited tasks, sparse rewards, and costly real-world interactions.
Here are the major obstacles:
- Costly interactions: Each real-world action—like loading a webpage or clicking a button—consumes time and compute. Collecting the millions of samples needed for RL becomes impractical.
- Task scarcity: Rich task diversity is rare. Designing and validating them demands expensive human effort.
- Unstable feedback: Real-world environments are unpredictable; websites change, APIs fail, and rewards can be delayed or noisy.
- Infrastructure complexity: RL-ready environments often rely on heavy backend systems such as Docker or virtual machines, hindering scalability.
These constraints have kept RL from transforming LLM agents into truly adaptive decision-makers. So, what if instead of wrestling with the real world, we built a dream world—a synthetic, scalable environment tailored for learning?
This is the central idea behind the paper “Scaling Agent Learning via Experience Synthesis.” The researchers introduce DreamGym, a framework that sidesteps real-world limitations by synthesizing high-quality experiences for agents. As shown in Figure 1b, DreamGym uses a reasoning-driven “Experience Model” to create abundant, adaptable, and inexpensive interaction data, enabling effective RL training.

Figure 1b. DreamGym replaces costly real interactions with scalable synthetic experiences generated by a unified experience model.
In this article, we’ll unpack how DreamGym works, why it matters, and what it means for the future of intelligent agents.
Background: The Language of Learning
Before exploring DreamGym, let’s revisit the basics of how an RL agent learns.
An agent’s problem can be represented as a Markov Decision Process (MDP), defined by a tuple \((\mathcal{S}, \mathcal{A}, T, R, \gamma, \rho_0)\).
- States (\(\mathcal{S}\)) describe the environment at a given moment—for example, a webpage’s text and clickable elements.
- Actions (\(\mathcal{A}\)) represent operations the agent can take, such as clicking a button or typing a query.
- Transition function (\(T\)) determines how the environment changes when an action is taken.
- Rewards (\(R\)) measure success or progress.
- Policy (\(\pi_{\theta}\)) is the agent’s internal decision rule; given a state, it outputs a probability distribution over possible actions.
The objective of RL is to optimize the policy parameters \(\theta\) so the agent maximizes its expected cumulative rewards. Policy gradient methods like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) achieve this by nudging the policy toward actions that yield higher rewards.

Policy gradient principle—adjusting the agent’s policy toward beneficial actions.
Here, \(\hat{A}(s_t, a_t)\) represents the advantage function, estimating how much better action \(a_t\) was at state \(s_t\) than average.
- Proximal Policy Optimization (PPO) stabilizes training by bounding policy changes between updates and uses a learned value function \(V(s)\) to compute generalized advantages.

PPO: limits aggressive updates and improves stability with generalized advantage estimation.
- Group Relative Policy Optimization (GRPO) simplifies this process by discarding the value function. It normalizes rewards across a group of responses for the same task, producing relative advantages that make RL more scalable for LLMs.

GRPO: compares rewards across multiple attempts, removing dependence on value estimation.
Both methods demand vast streams of interaction data—exactly where real-world RL struggles. DreamGym’s synthetic experience approach solves this bottleneck.
The Core Method: Building a World for Learning
DreamGym is a unified ecosystem with three interlocking components:
- Reasoning Experience Model – generates synthetic environment feedback.
- Experience Replay Buffer – blends offline and online trajectories to stabilize training.
- Curriculum Task Generator – produces progressively harder tasks for sustained learning.

Figure 2. DreamGym architecture integrates reasoning-based experience synthesis, replay memory, and adaptive curriculum generation.
1. The Reasoning Experience Model: Dynamics by Language
At the heart of DreamGym lies the reasoning experience model (\(\mathcal{M}_{exp}\))—an LLM trained to emulate environment responses and rewards. When the agent issues an action, the experience model performs “step-wise reasoning” to predict the next state and corresponding reward, simulating how the real environment would evolve.
Crucially, agent training doesn’t need perfect realism. It requires diverse, informative, causally grounded transitions that support robust learning. By operating purely in a textual, abstract state space—rather than raw HTML or pixels—the experience model remains efficient.
Predicting Transitions
Beyond the immediate state-action pair, the model leverages three contexts to enhance reliability:
- Interaction history: keeps multi-step consistency.
- Task instruction: clarifies goals.
- Past experiences from replay: retrieved demonstrations reduce hallucinations.
These inputs feed into a Chain-of-Thought (CoT) reasoning step, guiding prediction of the next state and reward:

Explicit reasoning enables accurate causal transitions and well-grounded rewards.
For example, if the agent clicks a nonexistent button, the reasoning trace explicitly concludes “no state change; reward = 0,” preventing false signals.
Training the Experience Model
Training requires only modest real-world data. Offline trajectory datasets (like WebArena or ALFWorld logs) are annotated using a teacher LLM that explains why each transition occurs. The model is then fine-tuned on both reasoning generation and next-state prediction via supervised learning.

The model learns to reason and transition coherently, distilling real environmental logic.
This creates a powerful “virtual environment” capable of interacting with agents online for realistic, grounded training.
2. The Experience Replay Buffer: Keeping Synthetic Worlds Grounded
To ensure realism and prevent drift, DreamGym uses an experience replay buffer—a dynamic memory bank that stores trajectories. Initially seeded with offline real-environment data, the buffer is continuously enriched with synthetic interactions. As training proceeds, both the agent policy and experience model evolve together, ensuring recency, relevance, and stability.
This “co-evolution” mechanism mirrors human learning, where new experiences reinforce old ones while enabling adaptation.
3. The Curriculum Task Generator: Always a New Challenge
Learning stagnates if tasks are either too easy or impossible. DreamGym solves this through an automated curriculum-based task generator.

New tasks are synthesized from challenging seeds via shared parameters with the experience model.
Using a reward entropy heuristic, DreamGym identifies which tasks offer the highest information gain. Tasks where the agent inconsistently succeeds—showing both failures and successes—are deemed optimal learning examples:

Tasks with balanced success/failure rates drive maximal learning progress.
High-entropy tasks are expanded into progressively complex variations, automatically building a rich curriculum. As performance stabilizes, it moves on to harder tasks—creating perpetual growth without manual effort.
Experiments: DreamGym in Action
The researchers evaluated DreamGym across three domains:
- WebShop: e-commerce retrieval and purchase tasks.
- ALFWorld: text-based embodied control.
- WebArena: realistic web interaction involving APIs, forums, and multi-tab browsing.

Table 1. DreamGym consistently matches or outperforms traditional RL while using vastly fewer real interactions.
Key Findings
1. RL for Non-RL-Ready Environments
Traditional RL collapses in complex environments like WebArena. DreamGym, using purely synthetic rollouts, attains >30% success improvement across backbones. It proves RL can thrive even where direct interaction is infeasible.
2. Matching Real Performance—Zero Real Data
In RL-friendly environments such as WebShop and ALFWorld, agents trained entirely on DreamGym’s synthetic data perform nearly identically to their real-world counterparts trained on 80,000 interactions. This demonstrates exceptional sample efficiency.
3. Sim-to-Real Transfer (S2R)
The hybrid setup DreamGym-S2R fine-tunes agents trained on synthetic rollouts using only 5,000 real interactions. The result: substantial gains beyond both pure-synthetic and pure-real baselines.

Figure 3. DreamGym accelerates training, enhances generalization, and stabilizes learning dynamics.
- Efficiency: Dramatic reductions in training time and cost—up to 80% savings.
- Generalization: Agents trained on one environment transfer effectively to others.
- Stability: DreamGym’s learning curves are smoother and steeper, signaling reliable progress.
Why It Works: Insights and Ablations
To pinpoint what drives DreamGym’s success, the team systematically ablated its components.
Curriculum Matters
Without adaptive task generation, the agent plateaued early, confirming that continuous challenge is vital.

Table 2. Removing curriculum, memory replay, or reasoning undermines success rates in WebShop and WebArena.
The Anatomy of High-Quality Experiences
Evaluation using GPT-4o revealed the importance of historical context and explicit reasoning. Models lacking these produced less consistent, less informative states and hallucinated more often.

Figure 4. Comprehensive reasoning and history yield superior experience fidelity and variety.
As seen in the example below, the reasoning model traces actions coherently through multiple states—capturing logical, causally linked transitions.

Figure 6. DreamGym’s reasoning model produces contextually consistent multi-turn trajectories.
Data Efficiency and Model Size
DreamGym’s experience model is strikingly data-efficient. Strong performance emerges with just 10k–20k offline samples, and even smaller models (3B parameters) are effective.

Figure 5. DreamGym thrives even under limited data or compute budgets.
Conclusion: Reimagining How Agents Learn
DreamGym redefines the scaling frontier for reinforcement learning in language-based agents. Instead of battling the complexities of real-world environments, it synthesizes learning-rich experiences—diverse, reasoning-driven, and curriculum-aligned.
Key insights:
- Synthesis over simulation: The goal isn’t realism—it’s meaningful, causally sound experiences.
- Reasoning is essential: Step-wise logical reasoning anchors transitions and rewards.
- Curriculum keeps momentum: Adapting task difficulty dynamically enables sustained growth.
- Sim-to-real efficiency: Synthetic pretraining provides a powerful warm start for real-world fine-tuning.
Ultimately, DreamGym reveals that the primary bottleneck in RL for LLM agents lies not in algorithmic sophistication but in accessible, structured experience data. By viewing environments as generators of coherent reasoning rather than mere simulators, DreamGym charts a scalable path toward intelligent systems that learn, imagine, and act seamlessly across digital domains.
](https://deep-paper.org/en/paper/2511.03773/images/cover.png)