Large Language Models (LLMs) are rapidly evolving from impressive text generators into autonomous agents capable of tackling complex, real-world tasks. Imagine an AI that can not only answer your questions but also navigate websites to book a flight, conduct multi-step scientific research, or even play a digital game. This is the frontier of AI research: creating agents that can reason, plan, and act over long horizons.
But how do we teach an LLM to do this? Just like humans, the most effective way for an agent to learn is through practice—by interacting with an environment, trying things, making mistakes, and learning from the outcomes. This is the core idea behind Reinforcement Learning (RL).
However, training LLM agents with RL is notoriously difficult. Many existing approaches are limited to simple, single-turn tasks or rely on pre-existing expert demonstrations, which are expensive and hard to scale. The community has been missing a unified, flexible, and effective framework to train agents from scratch in diverse and realistic settings.
This is where a new research paper, AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning, comes in. The researchers introduce AgentGym-RL, a powerful open-source framework designed specifically for this challenge. They also propose a novel training method called ScalingInter-RL that dramatically improves training stability and performance.
The results are striking. As shown in Figure 1, their 7-billion-parameter model, trained with this new framework, not only surpasses other open-source models but also matches or even beats massive, closed-source giants like GPT-4o and Gemini-2.5-Pro across a diverse set of tasks.
Figure 1 Left: Performance of proprietary models, open-source models, and our RL models across different agentic tasks. Right: Performance versus model scale, showing that our 7B RL-trained model rivals or outperforms much larger proprietary models.
In this post, we’ll dive deep into this work—breaking down how AgentGym-RL operates, why ScalingInter-RL is so effective, and what these advancements mean for the next generation of AI agents.
A Quick Primer on Reinforcement Learning for Agents
Before we get into the nuts and bolts of the new framework, let’s quickly recap the basics of reinforcement learning in the context of LLM agents.
An agent’s task can be modeled as a Partially Observable Markov Decision Process (POMDP). Despite the name, the concept boils down to a few core components:
- State (\(s\)) — the current situation or configuration of the environment.
- Action (\(a\)) — a choice the agent makes, such as clicking a button or issuing a command.
- Observation (\(o\)) — information the agent receives from the environment after taking an action (e.g., webpage content, game status).
- Policy (\(\pi_{\theta}\)) — the agent’s “brain,” parameterized by model weights \(\theta\), which maps states to actions.
- Reward (\(r\)) — a feedback signal indicating success (1) or failure (0) at the end of a trajectory.
The RL goal is to adjust \(\theta\) to maximize the expected cumulative reward. The agent interacts with its environment, generating trajectories \(\tau\) (state, action, observation sequences) and learning from them.
Policy gradient methods are a popular way to achieve this. They work by directly estimating how changes in \(\theta\) affect the expected reward:
\[ J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ r(\tau) \right] \]The gradient of \(J\) tells us the direction to update \(\theta\):
\[ \nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ r(\tau) \sum_{k=0}^{K} \nabla_{\theta} \log \pi_{\theta}(a_k | s_k) \right] \]In simple terms: if a trajectory leads to success, increase the likelihood of the successful actions. Updates are applied via:
\[ \theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla_{\theta} J(\theta) \]This “act–reward–update” loop is the core of RL. Let’s see how AgentGym-RL builds on it.
The AgentGym-RL Framework: A Playground for Intelligent Agents
AgentGym-RL is a unified, modular, and extensible platform for training LLM agents in realistic, multi-turn scenarios.
Architecture Overview
Figure 2 Overview of AgentGym-RL’s decoupled architecture: Environment module (varied scenarios), Agent module (reasoning and actions), and Training module (RL pipeline).
It comprises three decoupled core components:
- Environment Module — provides diverse, realistic scenarios via a standardized server-client architecture. Easy to plug in new environments without altering training logic.
- Agent Module — encapsulates the LLM’s reasoning and decision-making. Processes observations and outputs actions.
- Training Module — collects trajectories, computes policy updates, and optimizes agents using RL algorithms.
Figure 3 Parallel rollout workflow: multiple environment instances per agent, concurrent trajectory collection, and RL policy updates with methods like PPO or GRPO.
Parallelism is key: multiple environment clients run independently. Agents interact, producing trajectories, which are batched and fed into training updates. This loop repeats, steadily improving the agent.
Key Features and Capabilities
Diverse Scenarios:
- Web Navigation (WebArena) — dynamic sites like e-commerce and forums.
- Deep Search — multi-hop question answering via search engines or interpreters.
- Digital Games (TextCraft) — text-based crafting requiring multi-step planning.
- Embodied Tasks (BabyAI) — grid-world navigation and manipulation.
- Scientific Tasks (SciWorld) — simulated scientific experimentation.
Broad Algorithm Support: PPO, GRPO, REINFORCE++, RLOO; plus SFT and offline preference optimization.
Engineered for Scale & Reliability: Optimized backends for parallelism, bug and memory-leak fixes for long-horizon stability.
Open-Source & Usable: Full framework is open-source with reproducible pipelines, standardized evaluation, and a visualized UI for trajectory inspection.
Figure 4 AgentGym Hub interface: environment selection and interactive trajectory visualization for debugging and analysis.
ScalingInter-RL: The Secret to Stable, Effective Training
Even with a robust framework, RL training faces the exploration vs. exploitation dilemma:
- Too much exploration early (long horizons) → noisy signals, wasted actions, training collapse.
- Too little exploration (short horizons) → early mastery of basics but no capacity for complex strategies.
ScalingInter-RL solves this via progressive interaction scaling.
Figure 5 Phased horizon scaling: initial exploitation with short turns, then incrementally longer horizons to enable exploration and high-order skills.
Phases:
- Early (Short Horizon): Limit max turns (\(h_t\)) per task. Forces exploitation—mastering simple tasks reliably.
- Progressive Scaling: Increase \(h_t\) over training via: \[ h_{t+1} = h_t + \delta_h \] Longer horizons allow more complex strategies, planning, and reflection. Aligns difficulty with evolving capability.
Experimental Results: Does It Work?
Extensive experiments across five scenarios confirmed both the framework and ScalingInter-RL deliver substantial, stable gains.
RL Elevates Open-Source Models to Elite Status
Figure 1’s headline result: a well-trained 7B open-source model rivals, and even surpasses, much larger proprietary systems. It shows post-training RL compute beats sheer parameter scaling.
Figure 6 Training rewards: stable, sustained improvements across diverse environments.
ScalingInter-RL Outperforms Fixed Horizons
Figure 7 Deep Search training dynamics: ScalingInter-RL achieves best long-term performance compared to fixed short/long horizons.
- Max Rounds = 10: Early boost, then collapse from noisy exploration.
- Max Rounds = 5: Stable but hits ceiling.
- ScalingInter-RL: Slower start, then surpasses both with stable climb.
Highlights Across Environments
- WebArena: ScalingInter-7B hit 26% accuracy—beating GPT-4o (16%) and competitive with DeepSeek-R1 (28%).
- Deep Search: Score 38.25%, topping GPT-4o (26.75%) and Gemini-2.5-Pro (36.50%).
- TextCraft: 91% overall, among best in class; rare success at hardest Depth 4.
- BabyAI: SOTA 96.67% overall, beating OpenAI o3.
- SciWorld: Jump from 1.5% (base) to 57% (ScalingInter-7B), a new state-of-the-art.
Case Study: Smarter Web Navigation
In a WebArena task to subscribe to a trending “pittsburgh” forum post:
- Base Model: Gets stuck in a click loop on non-interactive text—fails task.
- RL-Trained Model: Encounters “Page not found”, uses
go_back
to recover, searches “pittsburgh”, finds trending post, clicks “Subscribe” successfully.
This illustrates advanced error recovery, adaptive planning, and purposeful navigation—hallmarks of agentic intelligence unlocked by RL.
Conclusion and Key Takeaways
The AgentGym-RL paper delivers both a high-impact framework and a novel training methodology that expand what open-source models can achieve as agents.
Key points:
- AgentGym-RL: Powerful, open-source, modular RL platform for diverse environments and scalable training.
- ScalingInter-RL: Elegant, effective solution to exploration–exploitation trade-off; improves stability and final performance.
- Smarter Training > Bigger Models: Targeted RL can make a 7B model outperform models 10× its size.
- Step Toward Autonomous AI: Moving from static LLMs to adaptive agents that learn through interaction.
By open-sourcing their code, the authors enable the community to build on this work. As methods advance, expect agents that generalize better to novel tasks—bringing us closer to truly autonomous AI that can act meaningfully in our world.