A cheerful explorer monkey surrounded by symbols of science and learning — representing the curiosity and versatility of agentic AI.
Imagine asking your AI assistant to plan a weekend trip to a new city. You want it to book flights that avoid layovers, find a pet-friendly hotel near the city center, reserve a table at a highly-rated vegan restaurant, and buy tickets for a museum exhibit. This isn’t a simple question-and-answer task; it’s a complex, multi-step process that requires interacting with multiple external services: an airline API, a hotel booking system, a restaurant reservation platform, and a ticket vendor.
For AI to reach this level of practical usefulness, it needs to evolve from a pure language model into a capable agent — an AI that can use tools to take actions and interact with the digital world. The key to training such agents is data. But not just any data. They need agentic data — detailed logs (known as trajectories) of successful interactions with these tools. The problem? This kind of data is incredibly rare and difficult to create.
This scarcity has been a major bottleneck in developing truly intelligent agents. How can an AI learn to book a flight if it has never seen a successful flight booking trajectory? This is the challenge a recent paper, Towards General Agentic Intelligence via Environment Scaling, sets out to solve. The researchers propose a groundbreaking idea: what if the key to more intelligent agents isn’t just bigger models, but vastly more diverse and realistic environments for them to learn in?
In this post, we’ll dive deep into their approach, called AgentScaler. We’ll explore their clever two-part pipeline: first, a system for automatically building a massive universe of simulated tool-use environments, and second, a two-stage learning strategy that turns these simulated experiences into real-world capability.
The Agent’s Dilemma: The Scarcity of Experience
Training an agentic AI is a bit of a chicken-and-egg situation. To learn to use tools, the agent needs to see examples of tool use. But to generate those examples, you need an agent that already knows how to use tools.
Historically, researchers have tried two main workarounds:
- The Reverse Approach: Start with a known function call (e.g.,
book_flight(destination="LHR")
) and work backward to invent a user query that might have prompted it (“Book me a flight to London”). This can feel artificial and often doesn’t capture the complexity of a real conversation. - The Forward Approach: Start with a high-level user goal and have an agent try to solve it through simulated interaction. This is more realistic, but it hits a major wall: creating the simulated environments (the APIs, databases, and services the agent interacts with) is a manual, time-consuming, and unscalable process. You can’t train an agent on thousands of different APIs if you have to hand-code each one.
This is where the AgentScaler paper makes its mark. The authors realized that to break this bottleneck, they needed to automate the creation of the environments themselves.
The AgentScaler Pipeline: A Universe in a Nutshell
The core of the paper is a principled, two-stage pipeline for creating agentic data at scale and using it to train highly capable models:
- Environment Construction and Scaling: Automatically build diverse, fully simulated, and verifiable environments.
- Agent Experience Learning: Use these environments to generate high-quality interaction data and train the agent in a structured, two-phase curriculum.
Part 1: Building Simulated Worlds at Scale
The authors’ key insight is a simple but powerful abstraction: any function call can be thought of as a read or write operation on a database. Checking a flight’s availability is a read operation. Booking that flight is a write operation that changes the state of the database (e.g., one less seat is available).
This principle allows them to systematically turn a massive collection of APIs into fully functioning, simulated environments. The process, shown in Figure 1, has three steps.
Figure 1: The automatic pipeline turns raw API documentation into structured, executable environments used for agentic task construction.
Step 1: Scenario Collection
The process starts with raw materials. The researchers gathered a massive corpus of over 30,000 real-world APIs from various sources. After cleaning and refining this collection — including adding explicit input/output specifications — they had a rich pool of tools to build from.
Step 2: Tool Dependency Graph Modeling
Next, they needed to organize this chaotic collection of tools. They grouped tools into coherent domains of related APIs (e.g., a Travel Planning domain, a Project Management domain).
To do this, they represented tools as nodes in a graph. An edge is drawn between two tools if their parameters are similar enough to suggest they could be used together. For example, a search_hotels
tool and a book_room
tool both likely have location
and date
parameters, suggesting a strong connection. Similarity was computed via vector representations of parameters, and an edge E
between two tools i
and j
exists if:
They then applied the Louvain community detection algorithm to find clusters in this graph; each cluster became a domain, resulting in over 1,000 distinct domains.
Step 3: Programmatic Materialization
Finally, for each domain, the pipeline automatically:
- Generates a Database Schema: By analyzing all tool parameters in the domain, it designs a domain-specific database structure to serve as the environment’s state.
- Generates Executable Code: It writes Python functions for each tool that read from or write to this database, making every simulated environment operational and verifiable.
Part 2: Learning from Simulated Experience
With this universe of simulated environments ready, the researchers could create vast amounts of training data using simulated human–agent interplay.
As illustrated in Figure 2, the system starts with a high-level goal and first generates a golden solution path — a coherent sequence of tool calls sampled from the domain’s tool graph — along with the final “golden” state of the database. The agent then interacts with a simulated user and the environment to achieve the goal, producing an interaction trajectory.
Figure 2: The agent interacts with a simulated user, altering the simulated environment’s state. The process remains fully verifiable by comparing against golden references.
The Three-Stage Filtering Funnel
Not every generated trajectory is worth keeping. To ensure only high-quality experiences make it into training, the authors designed a three-stage filter:
- Validity Control: Remove malformed conversations and repetitive reasoning loops.
- Environment State Alignment: Compare the final database state after the agent’s actions with the golden state. Any mismatch means the
write
operations failed — trajectory discarded. - Function Calling Exact Match: For
read
-only trajectories (no state change), require the exact golden sequence of tool calls and arguments.
Interestingly, they keep trajectories where a tool call returned an error if the agent still achieved the goal, teaching the model resilience against tool failures.
Two-Stage Agentic Fine-Tuning
The filtered data then fine-tunes a base large language model, with the training loss:
\[ \mathcal{L}(\theta) = -\frac{1}{\sum_{k=1}^{|\mathcal{H}|} \mathbb{I}[x_k \in \mathcal{T}]} \sum_{k=1}^{|\mathcal{H}|} \mathbb{I}[x_k \in \mathcal{T}] \cdot \log \pi_{\theta} \left( x_k \mid x_{The curriculum:
- Stage 1 — General Foundation: Train on many domains to learn broad tool-usage behaviors.
- Stage 2 — Domain Specialization: Fine-tune on selected vertical domains (e.g., retail, airline) for expert performance.
The Results: Punching Above Their Weight
Applying this pipeline produced the AgentScaler model family (4B, 8B, 30B-A3B), built on Qwen3. Evaluated on agentic benchmarks (τ-bench
, τ²-Bench
, ACEBench), the results speak for themselves:
Table 1: AgentScaler models achieve state-of-the-art results for open-source models in their size category — often beating larger baselines.
Highlights:
- AgentScaler-30B-A3B rivals trillion-parameter open-source models and competes closely with closed-source leaders like GPT-o3 and Gemini-2.5-pro.
- AgentScaler-4B holds its own against or surpasses many 30B-parameter models, showing efficiency gains from high-quality, diverse environment training.
Dissecting the Performance
An ablation study (Figure 3) shows Stage 1 gives a big boost over the base model, while Stage 2 delivers further gains, especially in complex agentic tasks.
Figure 3: Both generalist Stage 1 and specialist Stage 2 training lift performance, validating the two-phase curriculum.
AgentScaler also generalizes well. On ACEBench-zh (Chinese), without explicit training for that language, it still beats its base — with the compact 4B model scoring +21.7 points overall.
Table 2: On out-of-distribution tasks, AgentScaler maintains strong performance and large gains, even for small models.
The Road Ahead: Challenges in Agentic Intelligence
The paper doesn’t shy away from challenges.
Stability Challenge:
The pass^k
metric shows how often a model succeeds repeatedly across multiple independent trials. As Figure 4 shows, scores drop as \(k\) increases — even AgentScaler, though more stable than its base, sees declines.
Figure 4: Even with gains over the base, stability across repeated trials remains an open challenge.
Long-Horizon Challenge:
Tasks needing many tool calls are still tough. Figure 5 makes it clear: accuracy drops as the number of required steps grows.
Figure 5: Longer tool-call sequences correlate with lower accuracy, exposing limits in long-horizon reasoning.
Conclusion: A New Era of Experience
AgentScaler offers a compelling solution to the agentic AI data bottleneck. By scaling environments — not just models — via automated, verifiable, simulated tool suites, and pairing that with a rich, staged training regimen, it makes powerful agents possible without massive parameter counts.
Abstracting tools as database operations, constructing domains programmatically, and applying a generalist-to-specialist curriculum are all strong advances. Next steps could include adding reinforcement learning to these stable simulated worlds, extending across modalities, and pushing toward real-world deployment.
We may well be entering a new era of experience for AI: where capability is driven not just by neural scale, but by the breadth, depth, and fidelity of interactions agents are trained on. With AgentScaler, we now have a way to build that experience at unprecedented scale.