Imagine you are trying to learn how to fly a plane. You could read the flight manual, memorize every switch and gauge, and hope for the best when you get in the cockpit. Or, you could spend hours in a flight simulator, facing storms, engine failures, and tricky landings before ever leaving the ground.
For Large Language Models (LLMs) acting as autonomous agents, the “learning” process has historically looked a lot like the first option. Agents—AI systems designed to use tools, browse the web, and execute tasks—often rely on static text descriptions (documentation) to understand how to act. When they encounter a new environment or a complex tool they haven’t seen before, they struggle. The manual might be outdated, the task might require a sequence of steps not described in the text, or the agent simply might not “understand” the nuance of the tool until it tries it.
In this post, we are diving deep into SynWorld, a new framework presented by researchers from Zhejiang University and Alibaba Group. SynWorld flips the script on how agents learn. Instead of relying solely on static data or risky trial-and-error in the real world, SynWorld allows agents to synthesize their own virtual scenarios—effectively building their own flight simulators—and explore them to refine their knowledge.

As illustrated in Figure 1 above, the core idea is simple yet profound: when an agent faces an unfamiliar environment, it generates simulated data, explores it to figure out “How to Act,” and uses the feedback to rewrite its own internal manual (Action Knowledge).
The Problem with Static Knowledge
To understand why SynWorld is necessary, we first need to look at the current state of Agent Planning.
Background: Agent Planning 101
An agent interacts with an environment by perceiving a state, selecting an action to achieve a goal, and receiving feedback. Mathematically, an agent’s planning mechanism \(\mathcal{P}_{\theta}\) can be defined as a function of the state space \(\mathcal{S}\), action space \(\mathcal{A}\), observation space \(\Omega\), and reward function \(\mathcal{R}\).

Here, \(\pi_{\theta}\) represents the model weights (the “brain” of the LLM). The agent uses this mechanism to generate plans.
The Knowledge Gap
Success in this planning process relies heavily on Action Knowledge. This knowledge consists of two parts:
- Action Description: Knowing what a specific tool does (e.g., “This API searches for videos”).
- Cognitive Workflow: Knowing the strategic sequence of steps to solve a problem (e.g., “First search for the video, then extract the ID, then download the transcript”).
The problem is that in new environments, the provided action descriptions are often poorly written or inconsistent with how the tool actually works. Furthermore, knowing what a tool does isn’t the same as knowing how to weave it into a complex workflow. Previous attempts to fix this involved “Self-Refine” loops where the agent tries a task and corrects itself. However, these methods usually rely on single-step scenarios and linear optimization, meaning the agent hits a performance ceiling quickly. They lack a sandbox to truly explore complex, multi-step possibilities.
The SynWorld Method
SynWorld addresses these limitations by creating a closed-loop system where the agent builds a virtual world, plays in it, and learns from it. The framework operates in two main phases: Scenario Synthesis and Action Knowledge Exploration.

As shown in Figure 2, the process begins by extracting tools from a toolkit to generate new scenes (tasks). Then, the agent uses Monte Carlo Tree Search (MCTS) to explore these virtual scenes. Let’s break down these distinct phases.
Phase 1: Virtual Scenario Synthesis
Before an agent can practice, it needs a practice field. SynWorld generates these fields by looking at the available tools (the Action Space) and asking: “What kind of problems could these tools solve?”
The researchers formalize scenario synthesis with the following equation:

Here, the system selects a subset of tools (\(t\)) from the complete toolkit (\(T\)). For each selection, it generates a Background (\(\mathcal{B}\)) and a Goal (\(\mathcal{G}\)).
- Background: The context and constraints (e.g., “You are a travel agent with a budget of $500…”).
- Goal: The objective that requires tool usage (e.g., “…find a flight to Paris and book a hotel”).
Ensuring Diversity
If the agent just generates the same easy scenario over and over, it won’t learn anything new. To prevent this, SynWorld enforces a diversity check. It compares the newly generated scenario against existing ones. If the similarity exceeds a certain threshold (\(\epsilon\)), the new scenario is discarded.

This ensures the agent creates a “curriculum” of diverse, non-trivial scenarios that actually challenge its planning capabilities.
Phase 2: Action Knowledge Exploration via MCTS
Once the virtual scenarios are built, how does the agent learn from them? SynWorld employs Monte Carlo Tree Search (MCTS). If you are familiar with how AlphaGo mastered the game of Go, you know the power of MCTS. It is a search algorithm that balances exploration (trying new things) and exploitation (sticking to what works) to find optimal paths.
In the context of SynWorld, the “path” is the refinement of Action Knowledge.
1. Initialization and Expansion
The search tree starts with the agent’s initial, imperfect knowledge. The agent selects a node (a version of knowledge) using the Upper Confidence Bound (UCB) algorithm, which helps it decide whether to refine a promising strategy or try a completely new approach.
When expanding a node, the agent looks at its past Optimization Experience (\(\mathcal{E}\)).

This experience record tracks the score before optimization (\(S_{before}\)), the score after (\(S_{after}\)), and the specific modification made (\(\mathcal{M}\)). This history prevents the agent from making the same mistakes twice.
2. Simulation and Refinement
The agent then uses its current Action Knowledge (\(\mathcal{AK}_{old}\)) and its past experiences to generate a new, optimized version of knowledge (\(\mathcal{AK}_{new}\)).

This new knowledge isn’t just a random guess; it’s an informed evolution based on previous trajectories (\(Tra\)).
3. Feedback Collection
Now, the agent puts this new knowledge to the test. It attempts to solve the virtual scenario using the updated manual and workflow.

The environment returns a trajectory (\(Tra_i\))—essentially a log of what happened—and a reward score (\(S_i\)). If the agent succeeded, the knowledge is validated. If it failed (e.g., an API error or a wrong answer), the failure becomes a learning signal.
This loop allows for Bidirectional Refinement. The agent simultaneously improves:
- The Tool Description: Making it more accurate to the code implementation.
- The Workflow: Finding better strategies to chain tools together.
Experiments and Results
Does “dreaming” in a virtual world actually help agents perform in the real world? The researchers tested SynWorld on two challenging benchmarks:
- ToolBench: A massive dataset involving over 16,000 real-world APIs.
- HotpotQA: A dataset requiring multi-hop reasoning (answering questions that require multiple search steps).
Main Performance Comparison
The results, summarized in Table 1 below, show that SynWorld consistently outperforms varied baselines, including ReAct, Self-Refine, and EasyTool.

Key Takeaways from the Data:
- ToolBench: SynWorld achieved a Pass Rate of 59.33 and a Win Rate of 73.00 using GPT-4-turbo. This is a significant jump over standard methods like ReAct (Pass Rate 50.67).
- HotpotQA: SynWorld achieved state-of-the-art results, indicating that the framework helps not just with tool usage, but with complex reasoning and planning workflows.
- Consistency: The improvement holds true across different backend models, including Qwen-long and Qwen2-72B, proving that the method is generalizable and not just overfitting to a specific LLM.
Ablation Study: Do we need both Workflow and Description?
The researchers performed an ablation study to see which part of the “Action Knowledge” was most important.

As Table 2 shows, removing either the Workflow optimization or the Description optimization leads to a drop in performance.
- w/o Workflow: The agent knows what the tools do but struggles to plan complex sequences.
- w/o Description: The agent has a plan but fails to execute specific tool calls correctly due to misunderstanding parameters or inputs.
The synergy is crucial: accurate tool descriptions help build better workflows, and executing workflows exposes hidden nuances in tool descriptions.
The Impact of “Dreaming” More
One of the most interesting questions is: “How much practice is enough?” The researchers analyzed how performance changes as the agent synthesizes more scenarios.

Figure 3 demonstrates a clear trend: More simulated data leads to better performance. The pass rate climbs steadily as the number of scenarios increases from 0 to 100. While the returns diminish slightly after 150 scenarios, the trajectory remains upward. This confirms that Action Knowledge is indeed learnable and scalable through synthesis.
Virtual Practice vs. Real Performance
Finally, the researchers asked if the knowledge gained in the virtual world transfers to the real world.

Figure 4 plots the pass rates in both virtual and real environments against the number of optimization iterations. The trends are nearly identical. This is a critical finding: It validates that the synthesized scenarios are high-quality proxies for the real world. An agent can improve its real-world capabilities without ever touching the real environment, simply by iterating in its self-generated virtual playground.
Conclusion and Implications
SynWorld represents a significant step forward in autonomous agent training. By allowing agents to synthesize their own training data (scenarios) and rigorously explore them using MCTS, the framework solves the “cold start” problem of deploying agents in new environments.
The key contributions are:
- Autonomy: Agents act as their own teachers, generating scenarios that target their specific knowledge gaps.
- Dual Refinement: The system improves both low-level tool understanding (descriptions) and high-level planning strategies (workflows) simultaneously.
- Generalization: Knowledge learned in the virtual sandbox transfers effectively to real-world tasks.
There are still challenges to address. Synthesizing scenarios is computationally expensive (token-intensive), and the current knowledge representation is purely text-based. Future work might explore more structured knowledge formats (like code snippets) or more efficient ways to filter synthesized scenarios.
However, the precedent is set: for AI agents to master the complexities of the real world, they first need to master the worlds they create for themselves.
](https://deep-paper.org/en/paper/2504.03561/images/cover.png)