Introduction: The Data Bottleneck for Web-Savvy AI
Large Language Model (LLM)-powered agents are rapidly evolving from simple chatbots into sophisticated digital assistants capable of tackling complex, open-ended tasks. Systems like OpenAI’s Deep Research, Google’s Gemini, and Perplexity AI can browse the web, gather information from multiple sources, and synthesize answers to questions that would have been impossible just a few years ago. This core capability is known as Information-Seeking (IS) — the engine driving the next generation of AI.
However, a major roadblock is slowing down progress: the scarcity of high-quality training data. To teach an agent how to effectively seek information, you need vast datasets of complex questions paired with the step-by-step reasoning and web-browsing actions needed to solve them. Creating this data manually is incredibly expensive and time-consuming.
Naturally, researchers have turned to AI to generate synthetic data. The prevailing approach, which the authors of a new paper call information-driven, works by first scraping a large corpus of information from the web and then prompting an LLM to create questions based on that content.
Figure 2: Data synthesis paradigm shift from information-driven (left) to formalization-driven (right). WebShaper flips the process by defining precise task structures before gathering data.
While this seems logical, it has two critical flaws:
- The LLM might struggle to craft a question whose reasoning structure perfectly matches the retrieved information, leading to inconsistencies or incorrect answers.
- The “collect-first, ask-later” method is inefficient, frequently gathering redundant data and limiting diversity.
To solve this, researchers from Alibaba Group propose a radical shift in their paper, WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization. Instead of starting with messy, unstructured web data, they begin with a formal, mathematical blueprint of what an IS task should be. This formalization-driven approach lets them precisely control the complexity and structure of a task before gathering the necessary information, resulting in higher-quality, more diverse, and more consistent training data.
In this article, we’ll unpack the WebShaper framework: its set-theory-based formalization of IS tasks, its agentic system for deep question expansion, and experimental results showing how this new paradigm trains state-of-the-art open-source IS agents.
A Formal Blueprint for Seeking Knowledge
Before building better questions, we need a better way to define a question. The authors argue that natural language is too ambiguous for systematic data generation. Instead, they propose a set theory-based formal language for IS tasks.
Consider this example from the paper:
Which player of a team in the 2004–05 season, who was born in the 90s? This team is founded in 1966 and is an East German football team.
To answer, you would:
- Identify the team that was founded in 1966 AND is an East German football team (→ Berliner FC Dynamo).
- Find players who played for that team in 2004 OR 2005.
- Find all players born in the 1990s.
- Intersect results from steps 2 and 3.
The formalization’s basic unit is the Knowledge Projection (KP) — the set of entities sharing a particular relation with another set. For a set \(V\) and a relation \(R\):
\[ R(V) = \{ u \mid \exists v \in V,\ (u, v) \in R \ \text{or} \ (v, u) \in R \} \]For example, if \(R\) = bornIn
and \(V\) = {'90s'
}, then \(R(V)\) is the set of all people born in the 1990s.
Figure 3: A question-answer case in WebShaper’s formalization. Purple shapes represent sets linked by relations.
IS tasks are built from KPs using:
Intersection (∩): Targets must satisfy all conditions.
\[ R(V) = R_1(S_1) \cap R_2(S_2) \cap \dots \cap R_n(S_n) \]
Example: playersplayAt
2000 ANDbornIn
the 90s:R-Union (∪): Targets can satisfy any of several conditions.
\[ R(V) = R(S_1) \cup R(S_2) \cup \dots \cup R(S_m) \]
Example: playersplayAt
2004 OR 2005:
Any IS task reduces to finding the elements of a target set \(T\) built from these operations:
\[ q(T) \triangleq ?T \]For our football example:
\[ \begin{aligned} q(T) &\triangleq T = R_{playIn}(T_1) \cap \big( R_{playAt}(\{2004\}) \cup R_{playAt}(\{2005\}) \big) \\ &\quad \cap \bigcup_{y=1990}^{1999} R_{bornIn}(\{y\}) \\ T_1 &= R_{foundIn}(\{1966\}) \cap R_{isA}(\text{East German football team}) \end{aligned} \]This machine-readable skeleton lets WebShaper control reasoning paths precisely.
The WebShaper Pipeline: From Blueprint to High-Quality Data
The formal language is the backbone; the pipeline turns it into rich datasets.
Step 1: Planting the Seeds
Researchers first generated 18,000 high-quality “seed questions” from an offline Wikipedia graph via random walks on linked articles. An LLM produced QA pairs grounded solely in visited content. Filtering removed noisy or hallucinated seeds.
Step 2: Agentic Expansion
Seeds grow into complex, multi-hop tasks via an autonomous Expander agent. The Expander uses WebShaper’s KP Representation — e.g.:
\[ [V@T, playIn, V@X],\ [V@T, playAt, C@2004\_05],\ [V@T, bornIn, C@90s],\ [V@X, foundIn, C@1966],\ [V@X, isA, C@\text{East German football team}] \]Naïve expansion risks:
- Redundancy: adding facts connecting constants without increasing reasoning depth.
- Reasoning shortcuts: adding facts that link directly to the target variable.
Figure 4: Comparing expansion paradigms. Layer-wise avoids pitfalls in Random/Sequential structures.
Layer-wise Expansion Strategy
Treat each question as a variable/constant graph. The Expander:
- Finds all “leaf” constants.
- Replaces a constant with a sub-problem that resolves to it.
- Merges the sub-question into the main query, preserving the original answer.
This deepens the reasoning chain without shortcuts.
Example: Replace C@1966
with a sub-problem about a team’s founding year inferred via another historical fact.
Expander Tools:
- Search: query Google for targeted info.
- Summarize: combine multiple sources (enabling R-Union).
- Validate: confirm correctness & complexity; reject too-simple sub-questions.
Steps 3 & 4: Trajectory Construction & Training
A second agent solves expanded questions, generating step-by-step trajectories (Thought–Action–Observation). After filtering for correctness, 5,000 of these trajectories feed into Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to train WebShaper agents.
Experiments: Putting WebShaper to the Test
WebShaper-trained agents based on Qwen-2.5 (32B, 72B) and QwQ-32B were evaluated on GAIA and WebWalkerQA, using Pass@1 scoring.
Main Results
Table 1: WebShaper dominates open-source baselines.
On GAIA, WebShaper-72B scored 60.1, beating WebDancer (51.5) and WebSailor (55.4).
Figure 1: GAIA leaderboard. WebShaper is the top open-source performer.
Performance gains held across all backbones, confirming dataset generalizability.
Why WebShaper Works: Formalization & Structure
Ablation studies isolate the key innovations:
Figure 7: Left — Formalization (FL) consistently beats Natural Language (NL).
Right — Layer-wise expansion outperforms Sequential expansion.
- Formalization vs. Natural Language: Formal KP representation yields richer, more accurate tasks.
- Layer-wise vs. Sequential: Controlled expansions avoid redundancy/shortcuts, producing higher-quality reasoning chains.
Generating Deeper Reasoning Tasks
Figure 8: Tool usage distribution. WebShaper tasks require more multi-hop search/visits.
WebShaper questions have a “long tail” of >3 tool calls — evidence of complex, multi-hop reasoning demands.
Case Study: Structural Integrity
Figure 10: Example from another system with redundancy & shortcut issues.
In contrast, WebShaper’s generated queries avoid irrelevant constants and ensure reasoning paths traverse all necessary variables — no premature links to the target.
Conclusion: A Paradigm Shift in IS Data Synthesis
WebShaper’s formalization-driven approach tackles the weaknesses of traditional information-driven synthesis:
- Precision: set-theory task definitions control reasoning paths.
- Diversity & Coverage: formalization encourages varied, high-complexity tasks.
- Structural Integrity: layer-wise expansion ensures every fact matters.
The outcome? State-of-the-art open-source IS agents on GAIA and WebWalkerQA.
Beyond a single dataset, WebShaper offers a general methodology for designing cognitive challenges for AI agents — decoupling task specification from data generation, enabling fine-grained control over difficulty, quality, and scale. This proactive, blueprint-first paradigm paves the way for building AI that can truly master seeking knowledge across the open web.