Large Language Models (LLMs) are rapidly evolving from impressive text generators into autonomous agents capable of tackling complex, real-world tasks. Imagine an AI that can not only answer your questions but also navigate websites to book a flight, conduct multi-step scientific research, or even play a digital game. This is the frontier of AI research: creating agents that can reason, plan, and act over long horizons.

But how do we teach an LLM to do this? Just like humans, the most effective way for an agent to learn is through practice—by interacting with an environment, trying things, making mistakes, and learning from the outcomes. This is the core idea behind Reinforcement Learning (RL).

However, training LLM agents with RL is notoriously difficult. Many existing approaches are limited to simple, single-turn tasks or rely on pre-existing expert demonstrations, which are expensive and hard to scale. The community has been missing a unified, flexible, and effective framework to train agents from scratch in diverse and realistic settings.

This is where a new research paper, AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning, comes in. The researchers introduce AgentGym-RL, a powerful open-source framework designed specifically for this challenge. They also propose a novel training method called ScalingInter-RL that dramatically improves training stability and performance.

The results are striking. As shown in Figure 1, their 7-billion-parameter model, trained with this new framework, not only surpasses other open-source models but also matches or even beats massive, closed-source giants like GPT-4o and Gemini-2.5-Pro across a diverse set of tasks.

Performance comparison: grouped bar chart showing average success rate across BabyAI, TextCraft, SciWorld, WebArena, and Deep Search (left), and a scatter plot of overall accuracy vs. model size (right). The “Ours-7B” model achieves top results, outperforming larger models.

Figure 1 Left: Performance of proprietary models, open-source models, and our RL models across different agentic tasks. Right: Performance versus model scale, showing that our 7B RL-trained model rivals or outperforms much larger proprietary models.


In this post, we’ll dive deep into this work—breaking down how AgentGym-RL operates, why ScalingInter-RL is so effective, and what these advancements mean for the next generation of AI agents.

A Quick Primer on Reinforcement Learning for Agents

Before we get into the nuts and bolts of the new framework, let’s quickly recap the basics of reinforcement learning in the context of LLM agents.

An agent’s task can be modeled as a Partially Observable Markov Decision Process (POMDP). Despite the name, the concept boils down to a few core components:

  • State (\(s\)) — the current situation or configuration of the environment.
  • Action (\(a\)) — a choice the agent makes, such as clicking a button or issuing a command.
  • Observation (\(o\)) — information the agent receives from the environment after taking an action (e.g., webpage content, game status).
  • Policy (\(\pi_{\theta}\)) — the agent’s “brain,” parameterized by model weights \(\theta\), which maps states to actions.
  • Reward (\(r\)) — a feedback signal indicating success (1) or failure (0) at the end of a trajectory.

The RL goal is to adjust \(\theta\) to maximize the expected cumulative reward. The agent interacts with its environment, generating trajectories \(\tau\) (state, action, observation sequences) and learning from them.

Policy gradient methods are a popular way to achieve this. They work by directly estimating how changes in \(\theta\) affect the expected reward:

\[ J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ r(\tau) \right] \]

The gradient of \(J\) tells us the direction to update \(\theta\):

\[ \nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ r(\tau) \sum_{k=0}^{K} \nabla_{\theta} \log \pi_{\theta}(a_k | s_k) \right] \]

In simple terms: if a trajectory leads to success, increase the likelihood of the successful actions. Updates are applied via:

\[ \theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla_{\theta} J(\theta) \]

This “act–reward–update” loop is the core of RL. Let’s see how AgentGym-RL builds on it.


The AgentGym-RL Framework: A Playground for Intelligent Agents

AgentGym-RL is a unified, modular, and extensible platform for training LLM agents in realistic, multi-turn scenarios.

Architecture Overview

Overview diagram of AgentGym-RL framework showing decoupled Environment, Agent, and Training modules, with diverse scenarios like web, search, games, embodied, and science.

Figure 2 Overview of AgentGym-RL’s decoupled architecture: Environment module (varied scenarios), Agent module (reasoning and actions), and Training module (RL pipeline).

It comprises three decoupled core components:

  1. Environment Module — provides diverse, realistic scenarios via a standardized server-client architecture. Easy to plug in new environments without altering training logic.
  2. Agent Module — encapsulates the LLM’s reasoning and decision-making. Processes observations and outputs actions.
  3. Training Module — collects trajectories, computes policy updates, and optimizes agents using RL algorithms.

Diagram showing pseudocode and workflow: parallel rollouts where agents interact with environments, collect trajectories, and update policy.

Figure 3 Parallel rollout workflow: multiple environment instances per agent, concurrent trajectory collection, and RL policy updates with methods like PPO or GRPO.

Parallelism is key: multiple environment clients run independently. Agents interact, producing trajectories, which are batched and fed into training updates. This loop repeats, steadily improving the agent.


Key Features and Capabilities

  • Diverse Scenarios:

    • Web Navigation (WebArena) — dynamic sites like e-commerce and forums.
    • Deep Search — multi-hop question answering via search engines or interpreters.
    • Digital Games (TextCraft) — text-based crafting requiring multi-step planning.
    • Embodied Tasks (BabyAI) — grid-world navigation and manipulation.
    • Scientific Tasks (SciWorld) — simulated scientific experimentation.
  • Broad Algorithm Support: PPO, GRPO, REINFORCE++, RLOO; plus SFT and offline preference optimization.

  • Engineered for Scale & Reliability: Optimized backends for parallelism, bug and memory-leak fixes for long-horizon stability.

  • Open-Source & Usable: Full framework is open-source with reproducible pipelines, standardized evaluation, and a visualized UI for trajectory inspection.

AgentGym Hub UI screenshot with environment menu and visual map navigation.

Figure 4 AgentGym Hub interface: environment selection and interactive trajectory visualization for debugging and analysis.


ScalingInter-RL: The Secret to Stable, Effective Training

Even with a robust framework, RL training faces the exploration vs. exploitation dilemma:

  • Too much exploration early (long horizons) → noisy signals, wasted actions, training collapse.
  • Too little exploration (short horizons) → early mastery of basics but no capacity for complex strategies.

ScalingInter-RL solves this via progressive interaction scaling.

Illustration of ScalingInter-RL: start with short horizon to master basics, then gradually increase to encourage exploration and tackle complex tasks.

Figure 5 Phased horizon scaling: initial exploitation with short turns, then incrementally longer horizons to enable exploration and high-order skills.

Phases:

  1. Early (Short Horizon): Limit max turns (\(h_t\)) per task. Forces exploitation—mastering simple tasks reliably.
  2. Progressive Scaling: Increase \(h_t\) over training via: \[ h_{t+1} = h_t + \delta_h \] Longer horizons allow more complex strategies, planning, and reflection. Aligns difficulty with evolving capability.

Experimental Results: Does It Work?

Extensive experiments across five scenarios confirmed both the framework and ScalingInter-RL deliver substantial, stable gains.

RL Elevates Open-Source Models to Elite Status

Figure 1’s headline result: a well-trained 7B open-source model rivals, and even surpasses, much larger proprietary systems. It shows post-training RL compute beats sheer parameter scaling.

Training reward curves over steps for WebArena, Deep Search, TextCraft, BabyAI, SciWorld, all rising steadily.

Figure 6 Training rewards: stable, sustained improvements across diverse environments.


ScalingInter-RL Outperforms Fixed Horizons

Reward vs. steps for Deep Search: long horizon collapses; short horizon plateaus; ScalingInter-RL steadily climbs to top.

Figure 7 Deep Search training dynamics: ScalingInter-RL achieves best long-term performance compared to fixed short/long horizons.

  • Max Rounds = 10: Early boost, then collapse from noisy exploration.
  • Max Rounds = 5: Stable but hits ceiling.
  • ScalingInter-RL: Slower start, then surpasses both with stable climb.

Highlights Across Environments

  • WebArena: ScalingInter-7B hit 26% accuracy—beating GPT-4o (16%) and competitive with DeepSeek-R1 (28%).
  • Deep Search: Score 38.25%, topping GPT-4o (26.75%) and Gemini-2.5-Pro (36.50%).
  • TextCraft: 91% overall, among best in class; rare success at hardest Depth 4.
  • BabyAI: SOTA 96.67% overall, beating OpenAI o3.
  • SciWorld: Jump from 1.5% (base) to 57% (ScalingInter-7B), a new state-of-the-art.

Case Study: Smarter Web Navigation

In a WebArena task to subscribe to a trending “pittsburgh” forum post:

  • Base Model: Gets stuck in a click loop on non-interactive text—fails task.
  • RL-Trained Model: Encounters “Page not found”, uses go_back to recover, searches “pittsburgh”, finds trending post, clicks “Subscribe” successfully.

This illustrates advanced error recovery, adaptive planning, and purposeful navigation—hallmarks of agentic intelligence unlocked by RL.


Conclusion and Key Takeaways

The AgentGym-RL paper delivers both a high-impact framework and a novel training methodology that expand what open-source models can achieve as agents.

Key points:

  1. AgentGym-RL: Powerful, open-source, modular RL platform for diverse environments and scalable training.
  2. ScalingInter-RL: Elegant, effective solution to exploration–exploitation trade-off; improves stability and final performance.
  3. Smarter Training > Bigger Models: Targeted RL can make a 7B model outperform models 10× its size.
  4. Step Toward Autonomous AI: Moving from static LLMs to adaptive agents that learn through interaction.

By open-sourcing their code, the authors enable the community to build on this work. As methods advance, expect agents that generalize better to novel tasks—bringing us closer to truly autonomous AI that can act meaningfully in our world.