Introduction: The AI Agent’s Dilemma

Imagine an AI agent that can book your travel on websites, manage your schedule by interacting with digital tools, or even navigate complex e-commerce platforms for you. This is the promise of autonomous agents powered by large language models (LLMs). They can reason, plan, and act with impressive versatility. Yet, to truly excel in real-world tasks, they must learn from experience—just as humans do.

The natural choice for this kind of interactive learning is reinforcement learning (RL), where agents improve through trial and error, receiving rewards for good actions and penalties for poor ones. This approach is what trained AI systems to master games like Go. However, applying RL to real-world LLM agents has proven difficult and expensive.

In traditional setups, shown in Figure 1a, the agent interacts directly with a real environment. This process is slow, costly, and fragile.

Figure 1a shows the traditional agent learning paradigm. An agent interacts directly with a real environment, labeled “not scalable” and providing “sparse & unstable” rewards, while tasks are “scarce & costly.”

Figure 1a. Traditional agent learning suffers from limited tasks, sparse rewards, and costly real-world interactions.

Here are the major obstacles:

  • Costly interactions: Each real-world action—like loading a webpage or clicking a button—consumes time and compute. Collecting the millions of samples needed for RL becomes impractical.
  • Task scarcity: Rich task diversity is rare. Designing and validating them demands expensive human effort.
  • Unstable feedback: Real-world environments are unpredictable; websites change, APIs fail, and rewards can be delayed or noisy.
  • Infrastructure complexity: RL-ready environments often rely on heavy backend systems such as Docker or virtual machines, hindering scalability.

These constraints have kept RL from transforming LLM agents into truly adaptive decision-makers. So, what if instead of wrestling with the real world, we built a dream world—a synthetic, scalable environment tailored for learning?

This is the central idea behind the paper “Scaling Agent Learning via Experience Synthesis.” The researchers introduce DreamGym, a framework that sidesteps real-world limitations by synthesizing high-quality experiences for agents. As shown in Figure 1b, DreamGym uses a reasoning-driven “Experience Model” to create abundant, adaptable, and inexpensive interaction data, enabling effective RL training.

Figure 1b shows the DreamGym paradigm. The agent interacts with a scalable Experience Model that synthesizes abundant and adaptable experiences from varied, cheap task variations, enabling efficient learning.

Figure 1b. DreamGym replaces costly real interactions with scalable synthetic experiences generated by a unified experience model.

In this article, we’ll unpack how DreamGym works, why it matters, and what it means for the future of intelligent agents.


Background: The Language of Learning

Before exploring DreamGym, let’s revisit the basics of how an RL agent learns.

An agent’s problem can be represented as a Markov Decision Process (MDP), defined by a tuple \((\mathcal{S}, \mathcal{A}, T, R, \gamma, \rho_0)\).

  • States (\(\mathcal{S}\)) describe the environment at a given moment—for example, a webpage’s text and clickable elements.
  • Actions (\(\mathcal{A}\)) represent operations the agent can take, such as clicking a button or typing a query.
  • Transition function (\(T\)) determines how the environment changes when an action is taken.
  • Rewards (\(R\)) measure success or progress.
  • Policy (\(\pi_{\theta}\)) is the agent’s internal decision rule; given a state, it outputs a probability distribution over possible actions.

The objective of RL is to optimize the policy parameters \(\theta\) so the agent maximizes its expected cumulative rewards. Policy gradient methods like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) achieve this by nudging the policy toward actions that yield higher rewards.

The policy gradient equation, which adjusts the policy based on the advantage function.

Policy gradient principle—adjusting the agent’s policy toward beneficial actions.

Here, \(\hat{A}(s_t, a_t)\) represents the advantage function, estimating how much better action \(a_t\) was at state \(s_t\) than average.

  1. Proximal Policy Optimization (PPO) stabilizes training by bounding policy changes between updates and uses a learned value function \(V(s)\) to compute generalized advantages. The equation for Generalized Advantage Estimation (GAE) used in PPO.

PPO: limits aggressive updates and improves stability with generalized advantage estimation.

  1. Group Relative Policy Optimization (GRPO) simplifies this process by discarding the value function. It normalizes rewards across a group of responses for the same task, producing relative advantages that make RL more scalable for LLMs. The GRPO advantage function, which normalizes rewards within a group of responses.

GRPO: compares rewards across multiple attempts, removing dependence on value estimation.

Both methods demand vast streams of interaction data—exactly where real-world RL struggles. DreamGym’s synthetic experience approach solves this bottleneck.


The Core Method: Building a World for Learning

DreamGym is a unified ecosystem with three interlocking components:

  1. Reasoning Experience Model – generates synthetic environment feedback.
  2. Experience Replay Buffer – blends offline and online trajectories to stabilize training.
  3. Curriculum Task Generator – produces progressively harder tasks for sustained learning.

An overview of the DreamGym framework, showing how the agent interacts with a reasoning experience model, supported by a replay buffer and a curriculum task generator.

Figure 2. DreamGym architecture integrates reasoning-based experience synthesis, replay memory, and adaptive curriculum generation.

1. The Reasoning Experience Model: Dynamics by Language

At the heart of DreamGym lies the reasoning experience model (\(\mathcal{M}_{exp}\))—an LLM trained to emulate environment responses and rewards. When the agent issues an action, the experience model performs “step-wise reasoning” to predict the next state and corresponding reward, simulating how the real environment would evolve.

Crucially, agent training doesn’t need perfect realism. It requires diverse, informative, causally grounded transitions that support robust learning. By operating purely in a textual, abstract state space—rather than raw HTML or pixels—the experience model remains efficient.

Predicting Transitions

Beyond the immediate state-action pair, the model leverages three contexts to enhance reliability:

  1. Interaction history: keeps multi-step consistency.
  2. Task instruction: clarifies goals.
  3. Past experiences from replay: retrieved demonstrations reduce hallucinations.

These inputs feed into a Chain-of-Thought (CoT) reasoning step, guiding prediction of the next state and reward:

The equation showing how the experience model predicts the next state and reward using reasoning, history, retrieved demonstrations, and task instructions.

Explicit reasoning enables accurate causal transitions and well-grounded rewards.

For example, if the agent clicks a nonexistent button, the reasoning trace explicitly concludes “no state change; reward = 0,” preventing false signals.

Training the Experience Model

Training requires only modest real-world data. Offline trajectory datasets (like WebArena or ALFWorld logs) are annotated using a teacher LLM that explains why each transition occurs. The model is then fine-tuned on both reasoning generation and next-state prediction via supervised learning.

The SFT loss function for training the experience model, which jointly optimizes reasoning generation and next-state prediction.

The model learns to reason and transition coherently, distilling real environmental logic.

This creates a powerful “virtual environment” capable of interacting with agents online for realistic, grounded training.


2. The Experience Replay Buffer: Keeping Synthetic Worlds Grounded

To ensure realism and prevent drift, DreamGym uses an experience replay buffer—a dynamic memory bank that stores trajectories. Initially seeded with offline real-environment data, the buffer is continuously enriched with synthetic interactions. As training proceeds, both the agent policy and experience model evolve together, ensuring recency, relevance, and stability.

This “co-evolution” mechanism mirrors human learning, where new experiences reinforce old ones while enabling adaptation.


3. The Curriculum Task Generator: Always a New Challenge

Learning stagnates if tasks are either too easy or impossible. DreamGym solves this through an automated curriculum-based task generator.

The equation for generating a new task based on a set of seed tasks.

New tasks are synthesized from challenging seeds via shared parameters with the experience model.

Using a reward entropy heuristic, DreamGym identifies which tasks offer the highest information gain. Tasks where the agent inconsistently succeeds—showing both failures and successes—are deemed optimal learning examples:

The equation for calculating the value of a task based on the variance of its rewards.

Tasks with balanced success/failure rates drive maximal learning progress.

High-entropy tasks are expanded into progressively complex variations, automatically building a rich curriculum. As performance stabilizes, it moves on to harder tasks—creating perpetual growth without manual effort.


Experiments: DreamGym in Action

The researchers evaluated DreamGym across three domains:

  • WebShop: e-commerce retrieval and purchase tasks.
  • ALFWorld: text-based embodied control.
  • WebArena: realistic web interaction involving APIs, forums, and multi-tab browsing.

Table 1 shows DreamGym compared to offline, online RL, and sim-to-real approaches across environments and model sizes. DreamGym and DreamGym-S2R achieve strong results with far less real data.

Table 1. DreamGym consistently matches or outperforms traditional RL while using vastly fewer real interactions.

Key Findings

1. RL for Non-RL-Ready Environments

Traditional RL collapses in complex environments like WebArena. DreamGym, using purely synthetic rollouts, attains >30% success improvement across backbones. It proves RL can thrive even where direct interaction is infeasible.

2. Matching Real Performance—Zero Real Data

In RL-friendly environments such as WebShop and ALFWorld, agents trained entirely on DreamGym’s synthetic data perform nearly identically to their real-world counterparts trained on 80,000 interactions. This demonstrates exceptional sample efficiency.

3. Sim-to-Real Transfer (S2R)

The hybrid setup DreamGym-S2R fine-tunes agents trained on synthetic rollouts using only 5,000 real interactions. The result: substantial gains beyond both pure-synthetic and pure-real baselines.

Figure 3 shows performance comparisons: Left—DreamGym reduces training time on WebArena. Middle—strong cross-domain generalization. Right—faster, smoother learning curves on WebShop.

Figure 3. DreamGym accelerates training, enhances generalization, and stabilizes learning dynamics.

  • Efficiency: Dramatic reductions in training time and cost—up to 80% savings.
  • Generalization: Agents trained on one environment transfer effectively to others.
  • Stability: DreamGym’s learning curves are smoother and steeper, signaling reliable progress.

Why It Works: Insights and Ablations

To pinpoint what drives DreamGym’s success, the team systematically ablated its components.

Curriculum Matters

Without adaptive task generation, the agent plateaued early, confirming that continuous challenge is vital.

Table 2 shows performance drops when removing key components: task generation, replay, or reasoning each reduce success rates significantly.

Table 2. Removing curriculum, memory replay, or reasoning undermines success rates in WebShop and WebArena.

The Anatomy of High-Quality Experiences

Evaluation using GPT-4o revealed the importance of historical context and explicit reasoning. Models lacking these produced less consistent, less informative states and hallucinated more often.

Figure 4 shows bar charts comparing variants. Full DreamGym achieves high scores for consistency, diversity, informativeness, and low hallucination.

Figure 4. Comprehensive reasoning and history yield superior experience fidelity and variety.

As seen in the example below, the reasoning model traces actions coherently through multiple states—capturing logical, causally linked transitions.

A case study from WebArena showing DreamGym generating a coherent multi-step trajectory. The model reasons about clicks from a change log summary to commit details.

Figure 6. DreamGym’s reasoning model produces contextually consistent multi-turn trajectories.

Data Efficiency and Model Size

DreamGym’s experience model is strikingly data-efficient. Strong performance emerges with just 10k–20k offline samples, and even smaller models (3B parameters) are effective.

Figure 5 shows that DreamGym scales with data and backbone size. Success rates rise rapidly with few offline samples; smaller models remain viable.

Figure 5. DreamGym thrives even under limited data or compute budgets.


Conclusion: Reimagining How Agents Learn

DreamGym redefines the scaling frontier for reinforcement learning in language-based agents. Instead of battling the complexities of real-world environments, it synthesizes learning-rich experiences—diverse, reasoning-driven, and curriculum-aligned.

Key insights:

  1. Synthesis over simulation: The goal isn’t realism—it’s meaningful, causally sound experiences.
  2. Reasoning is essential: Step-wise logical reasoning anchors transitions and rewards.
  3. Curriculum keeps momentum: Adapting task difficulty dynamically enables sustained growth.
  4. Sim-to-real efficiency: Synthetic pretraining provides a powerful warm start for real-world fine-tuning.

Ultimately, DreamGym reveals that the primary bottleneck in RL for LLM agents lies not in algorithmic sophistication but in accessible, structured experience data. By viewing environments as generators of coherent reasoning rather than mere simulators, DreamGym charts a scalable path toward intelligent systems that learn, imagine, and act seamlessly across digital domains.