Introduction: The Data Bottleneck for Web-Savvy AI

Large Language Model (LLM)-powered agents are rapidly evolving from simple chatbots into sophisticated digital assistants capable of tackling complex, open-ended tasks. Systems like OpenAI’s Deep Research, Google’s Gemini, and Perplexity AI can browse the web, gather information from multiple sources, and synthesize answers to questions that would have been impossible just a few years ago. This core capability is known as Information-Seeking (IS) — the engine driving the next generation of AI.

However, a major roadblock is slowing down progress: the scarcity of high-quality training data. To teach an agent how to effectively seek information, you need vast datasets of complex questions paired with the step-by-step reasoning and web-browsing actions needed to solve them. Creating this data manually is incredibly expensive and time-consuming.

Naturally, researchers have turned to AI to generate synthetic data. The prevailing approach, which the authors of a new paper call information-driven, works by first scraping a large corpus of information from the web and then prompting an LLM to create questions based on that content.

A diagram showing the shift from an information-driven to a formalization-driven data synthesis paradigm.

Figure 2: Data synthesis paradigm shift from information-driven (left) to formalization-driven (right). WebShaper flips the process by defining precise task structures before gathering data.

While this seems logical, it has two critical flaws:

  1. The LLM might struggle to craft a question whose reasoning structure perfectly matches the retrieved information, leading to inconsistencies or incorrect answers.
  2. The “collect-first, ask-later” method is inefficient, frequently gathering redundant data and limiting diversity.

To solve this, researchers from Alibaba Group propose a radical shift in their paper, WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization. Instead of starting with messy, unstructured web data, they begin with a formal, mathematical blueprint of what an IS task should be. This formalization-driven approach lets them precisely control the complexity and structure of a task before gathering the necessary information, resulting in higher-quality, more diverse, and more consistent training data.

In this article, we’ll unpack the WebShaper framework: its set-theory-based formalization of IS tasks, its agentic system for deep question expansion, and experimental results showing how this new paradigm trains state-of-the-art open-source IS agents.


A Formal Blueprint for Seeking Knowledge

Before building better questions, we need a better way to define a question. The authors argue that natural language is too ambiguous for systematic data generation. Instead, they propose a set theory-based formal language for IS tasks.

Consider this example from the paper:

Which player of a team in the 2004–05 season, who was born in the 90s? This team is founded in 1966 and is an East German football team.

To answer, you would:

  1. Identify the team that was founded in 1966 AND is an East German football team (→ Berliner FC Dynamo).
  2. Find players who played for that team in 2004 OR 2005.
  3. Find all players born in the 1990s.
  4. Intersect results from steps 2 and 3.

The formalization’s basic unit is the Knowledge Projection (KP) — the set of entities sharing a particular relation with another set. For a set \(V\) and a relation \(R\):

\[ R(V) = \{ u \mid \exists v \in V,\ (u, v) \in R \ \text{or} \ (v, u) \in R \} \]

For example, if \(R\) = bornIn and \(V\) = {'90s'}, then \(R(V)\) is the set of all people born in the 1990s.

An example of a question formalized using Knowledge Projections, showing how set operations combine to find the answer.

Figure 3: A question-answer case in WebShaper’s formalization. Purple shapes represent sets linked by relations.

IS tasks are built from KPs using:

  1. Intersection (∩): Targets must satisfy all conditions.
    Example: players playAt 2000 AND bornIn the 90s:

    \[ R(V) = R_1(S_1) \cap R_2(S_2) \cap \dots \cap R_n(S_n) \]
  2. R-Union (∪): Targets can satisfy any of several conditions.
    Example: players playAt 2004 OR 2005:

    \[ R(V) = R(S_1) \cup R(S_2) \cup \dots \cup R(S_m) \]

Any IS task reduces to finding the elements of a target set \(T\) built from these operations:

\[ q(T) \triangleq ?T \]

For our football example:

\[ \begin{aligned} q(T) &\triangleq T = R_{playIn}(T_1) \cap \big( R_{playAt}(\{2004\}) \cup R_{playAt}(\{2005\}) \big) \\ &\quad \cap \bigcup_{y=1990}^{1999} R_{bornIn}(\{y\}) \\ T_1 &= R_{foundIn}(\{1966\}) \cap R_{isA}(\text{East German football team}) \end{aligned} \]

This machine-readable skeleton lets WebShaper control reasoning paths precisely.


The WebShaper Pipeline: From Blueprint to High-Quality Data

The formal language is the backbone; the pipeline turns it into rich datasets.

Step 1: Planting the Seeds

Researchers first generated 18,000 high-quality “seed questions” from an offline Wikipedia graph via random walks on linked articles. An LLM produced QA pairs grounded solely in visited content. Filtering removed noisy or hallucinated seeds.

Step 2: Agentic Expansion

Seeds grow into complex, multi-hop tasks via an autonomous Expander agent. The Expander uses WebShaper’s KP Representation — e.g.:

\[ [V@T, playIn, V@X],\ [V@T, playAt, C@2004\_05],\ [V@T, bornIn, C@90s],\ [V@X, foundIn, C@1966],\ [V@X, isA, C@\text{East German football team}] \]

Naïve expansion risks:

  • Redundancy: adding facts connecting constants without increasing reasoning depth.
  • Reasoning shortcuts: adding facts that link directly to the target variable.

A diagram comparing Random, Sequential, and Layer-wise expansion structures, highlighting how Layer-wise avoids redundancy and reasoning shortcuts.

Figure 4: Comparing expansion paradigms. Layer-wise avoids pitfalls in Random/Sequential structures.

Layer-wise Expansion Strategy
Treat each question as a variable/constant graph. The Expander:

  1. Finds all “leaf” constants.
  2. Replaces a constant with a sub-problem that resolves to it.
  3. Merges the sub-question into the main query, preserving the original answer.

This deepens the reasoning chain without shortcuts.
Example: Replace C@1966 with a sub-problem about a team’s founding year inferred via another historical fact.

Expander Tools:

  • Search: query Google for targeted info.
  • Summarize: combine multiple sources (enabling R-Union).
  • Validate: confirm correctness & complexity; reject too-simple sub-questions.

Steps 3 & 4: Trajectory Construction & Training

A second agent solves expanded questions, generating step-by-step trajectories (Thought–Action–Observation). After filtering for correctness, 5,000 of these trajectories feed into Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to train WebShaper agents.


Experiments: Putting WebShaper to the Test

WebShaper-trained agents based on Qwen-2.5 (32B, 72B) and QwQ-32B were evaluated on GAIA and WebWalkerQA, using Pass@1 scoring.

Main Results

Table showing the main results on GAIA and WebWalkerQA, where WebShaper models achieve the highest scores among open-source methods.

Table 1: WebShaper dominates open-source baselines.

On GAIA, WebShaper-72B scored 60.1, beating WebDancer (51.5) and WebSailor (55.4).
A bar chart comparing the performance of various AI agents on the GAIA benchmark, with WebShaper-72B leading all open-source models.

Figure 1: GAIA leaderboard. WebShaper is the top open-source performer.

Performance gains held across all backbones, confirming dataset generalizability.


Why WebShaper Works: Formalization & Structure

Ablation studies isolate the key innovations:

Two bar charts showing ablation studies. The left chart confirms that formalization (FL) outperforms natural language (NL), and the right chart shows the layer-wise structure outperforms a sequential one.

Figure 7: Left — Formalization (FL) consistently beats Natural Language (NL).
Right — Layer-wise expansion outperforms Sequential expansion.

  • Formalization vs. Natural Language: Formal KP representation yields richer, more accurate tasks.
  • Layer-wise vs. Sequential: Controlled expansions avoid redundancy/shortcuts, producing higher-quality reasoning chains.

Generating Deeper Reasoning Tasks

Line plots showing the distribution of tool calls. WebShaper questions (purple line) consistently require more Search and Visit operations, indicating higher complexity.

Figure 8: Tool usage distribution. WebShaper tasks require more multi-hop search/visits.

WebShaper questions have a “long tail” of >3 tool calls — evidence of complex, multi-hop reasoning demands.


Case Study: Structural Integrity

A case study of a flawed question from another system, showing redundancy and a reasoning shortcut—problems that WebShaper’s layer-wise expansion is designed to prevent.

Figure 10: Example from another system with redundancy & shortcut issues.

In contrast, WebShaper’s generated queries avoid irrelevant constants and ensure reasoning paths traverse all necessary variables — no premature links to the target.


Conclusion: A Paradigm Shift in IS Data Synthesis

WebShaper’s formalization-driven approach tackles the weaknesses of traditional information-driven synthesis:

  • Precision: set-theory task definitions control reasoning paths.
  • Diversity & Coverage: formalization encourages varied, high-complexity tasks.
  • Structural Integrity: layer-wise expansion ensures every fact matters.

The outcome? State-of-the-art open-source IS agents on GAIA and WebWalkerQA.

Beyond a single dataset, WebShaper offers a general methodology for designing cognitive challenges for AI agents — decoupling task specification from data generation, enabling fine-grained control over difficulty, quality, and scale. This proactive, blueprint-first paradigm paves the way for building AI that can truly master seeking knowledge across the open web.