Large Language Models (LLMs) are evolving from simple chatbots into sophisticated agents that can use tools to accomplish complex tasks. One of the most critical tools in an agent’s toolkit is the ability to browse the web—a gateway to the world’s information.

While commercial models like OpenAI’s GPT-4 and Google’s Gemini have made impressive strides, their strategies for web-browsing remain proprietary. In contrast, many open-source web agents struggle to match this performance, particularly on tasks that require deep, multi-step research.

What’s the bottleneck?
A new paper, WEBEXPLORER: Explore and Evolve for Training Long-Horizon Web Agents, argues that the limiting factor isn’t the models themselves—it’s the training data. Building an agent that can solve intricate problems demands training on genuinely challenging queries, the kind that might even stump human researchers.

The authors present a clever two-stage framework for automatically generating a large dataset of difficult, web-based questions and answers. By training an 8-billion-parameter model, WEBEXPLORER-8B, on this data, they achieve state-of-the-art performance for its size — even outperforming models up to ten times larger on several benchmarks.


The Core Problem: Scarcity of Hard Problems

Modern benchmarks for web agents, such as BrowseComp, feature questions so difficult that human annotators fail to solve more than half of them, even after hours of effort. These benchmarks are great for evaluation but are too small and expensive to use for large-scale training.

Existing data synthesis approaches have struggled:

  • Graph-based methods crawl web pages to build explicit knowledge graphs, but require complex heuristics for node expansion and selection.
  • Evolution-based methods modify simple questions to make them longer and ostensibly harder. Often, the results are unnaturally convoluted and fail to mimic realistic search difficulties.

The authors saw an opportunity: create large-scale questions that are subtly hard — queries that demand genuine exploration and multi-hop reasoning, much like the toughest human-curated benchmarks.


The WEBEXPLORER Framework: Explore and Evolve

The solution is a two-step process that mirrors how a curious human researcher might operate: first explore a topic in depth, then create a question that forces others to follow a similarly challenging path.

This diagram shows the two-stage WEBEXPLORER framework. On the left, ‘Model-Based Exploration’ shows an AI exploring a topic like ‘David Hackett Souter’ through a search-and-browse loop. On the right, ‘Iterative Query Evolution’ shows how an initial query (Q0) is made progressively harder (Q1 to Qn) by removing clues.

Stage 1: Model-Based Exploration

Instead of building rigid knowledge graphs, WEBEXPLORER uses a large language model to perform autonomous exploration.

The process starts with a seed entity — a topic such as Brazil National Team. The model is prompted to act like a researcher, equipped with just two tools:

  1. search(query): to query a search engine.
  2. browse(url, query): to read a page and extract targeted information.

With these, the model iteratively searches and browses, diving into related topics, following threads, and assembling a rich set of interconnected facts. It decides when to stop and synthesizes an initial Question-Answer (QA) pair from this information space.

This flowchart illustrates the model-based exploration process. Starting with ‘Brazil National Team’, the agent uses search (S) and browse (B) actions to connect disparate facts about the 1950 World Cup, a referee named George Reader, Southampton’s 1976 FA Cup win, the goalscorer Bobby Stokes, and Queen Elizabeth II, ultimately synthesizing a complex query.

Example: From the seed Brazil National Team, the agent connects:

  • The 1950 World Cup final and its record-setting attendance.
  • George Reader, the referee of that match.
  • Reader’s later role as Southampton FC chairman.
  • Southampton’s 1976 FA Cup victory.
  • Bobby Stokes, the goalscorer, and his birthplace.

The result: a QA pair that spans multiple sources and requires genuine multi-step reasoning.


Stage 2: Iterative Query Evolution (Long-to-Short)

The initial QA pairs from Stage 1 already require multi-website navigation, but were still too easy for strong proprietary models like Claude-4-Sonnet, which scored 86.6% accuracy on them.

The issue? Too many explicit clues. Dates, names, and direct references act as “open-book” hints, making it easy for models (or humans) to shortcut the problem.

The authors realized that the hardest benchmarks avoid such clues entirely, opting for vague but precise descriptions. Thus, Stage 2 focuses on making questions harder by removing clues instead of adding them.

The query is evolved over several steps, guided by three rules:

  1. Remove salient information — delete obvious indicators.
  2. Introduce obfuscation — replace names and dates with indirect descriptors.
  3. Use alternative descriptions — rephrase explicit references.

Example transformation:

Initial Query:

A football match took place in a stadium where the official attendance set a record that still stands today for FIFA World Cup matches. The referee of this match was the oldest person to ever officiate a World Cup final, and exactly 26 years after this match, he was the chairman of a club that defeated Manchester United in an FA Cup final. The player who scored the winning goal was born in an area that became part of its current city in 1920, and this player died at the age of 44. In what minute of the FA Cup final was the winning goal scored?
Answer: 83rd minute

Evolved Query:

In the unique FIFA World Cup tournament format that concluded without a knockout final, a match official later guided a Second Division club to victory over a First Division giant in the monarch’s final attendance at such an occasion. The match-winner had been rejected by the club he supported as a child, hailing from a district that joined a centuries-old Royal Naval stronghold two decades into the 20th century. In which minute did this decisive strike occur?
Answer: 83rd minute

By stripping away direct clues and introducing narrative indirection, the evolved query forces deeper investigation.


The Final Dataset: WEBEXPLORER-QA

The authors applied this two-stage pipeline to produce WEBEXPLORER-QA, a dataset of ~40,000 evolved QA pairs. Seed entities came from diverse Wikipedia pages to maintain variety.

To measure difficulty, Claude-4-Sonnet was evaluated across datasets:

DatasetAvg TurnsAccuracy (%)
Initial QA7.986.6
Evolved QA9.967.1
WebDancer5.462.0
SailorFog8.235.0
WebShaper8.467.4
ASearcher6.562.0

This table compares the difficulty of different web navigation datasets. The ‘Evolved QA’ from WEBEXPLORER has the highest average number of tool-calling turns (9.9) and a significantly lower accuracy (67.1%) compared to the ‘Initial QA’ (86.6%), indicating a successful increase in difficulty.

The results show the evolution process deliberately increased difficulty: more tool calls, lower accuracy, and richer reasoning requirements.

These two bar charts compare the distribution of tool-calling turns. The left chart shows that ‘Evolved QA’ has fewer easily solved questions (0–4 turns) than ‘Initial QA’. The right chart shows that while ‘BrowseComp-en’ is still harder, ‘Evolved QA’ provides a challenging but solvable dataset for training.


Training WEBEXPLORER-8B

With the dataset in place, the authors trained WEBEXPLORER-8B in two phases:

  1. Supervised Fine-Tuning (SFT)
    Used high-quality trajectories from a commercial model to teach basic mechanics: step-by-step reasoning, tool call formatting, and long-horizon thinking.

  2. Reinforcement Learning (RL)
    After SFT, RL allowed the model to experiment with its own strategies. Rewards combined correctness and formatting:

    \[ R = 0.2 \cdot R_{\text{format}} + R_{\text{correct}} \]

    The reward function used for reinforcement learning. It is a weighted sum of a format reward and a correctness reward.

    A key RL innovation was progressive context expansion:

    • Start: 64k tokens, 50 turns max.
    • Mid: 96k tokens, 75 turns.
    • End: 128k tokens, 100 turns max.

    This enabled the model to handle truly long-horizon reasoning.


Results: An 8B Model That Punches Above Its Weight

WEBEXPLORER-8B delivered remarkable results:

ModelBC-enBC-zhGAIAWebWalkerQAFRAMESXbench-DSHLE
WebSailor-72B12.030.155.4--55.0-
WebThinker-32B2.8-48.546.5--15.8
MiroThinker-8B-DPO-v0.18.713.646.645.764.4--
WebExplorer-8B (SFT)7.921.343.759.872.647.516.0
WebExplorer-8B (RL)15.732.050.062.775.753.717.3

This table shows the performance of various web agents on several benchmarks. WEBEXPLORER-8B (RL) achieves the best performance among open-source models under 100B on BrowseComp-en, BrowseComp-zh, WebWalkerQA, and FRAMES, significantly outperforming much larger models like WebSailor-72B.


Key takeaways:

  • SOTA at Its Scale: Best open-source performance under 100B parameters on BrowseComp-en/zh, WebWalkerQA, and FRAMES.
  • Parameter Efficiency: Beats WebSailor-72B on BrowseComp-en with 15.7% vs 12.0%.
  • Generalization: Scores 17.3% on HLE STEM benchmark, surpassing previous 32B models despite non-STEM training data.

These bar charts compare WEBEXPLORER-8B’s performance on three benchmarks. It achieves accuracies of 15.7% on BrowseComp-en, 32.0% on BrowseComp-zh, and 17.3% on HLE, demonstrating strong and generalizable performance.


RL Training Dynamics

The RL training logs tell a clear story:

These three line plots show the training dynamics during reinforcement learning. The average number of tool calls (left) and trajectory length (middle) steadily increase, which correlates with a consistent improvement in accuracy on the BrowseComp-en and BrowseComp-zh benchmarks (right).

  • Tool Calls: Increased from ~11 to over 16 per trajectory.
  • Trajectory Length: Grew to 40k+ tokens.
  • Accuracy: Steadily improved on both BrowseComp-en and BrowseComp-zh.

This demonstrates the emergence of deeper, more complex reasoning chains during training.


Conclusion and Implications

The WEBEXPLORER framework shows that the path to superhuman web agents lies in better training data, not just bigger models.

By “exploring” a topic comprehensively and then “evolving” the queries to remove easy clues, the authors synthesized realistic, challenging datasets at scale. Training on these — with an SFT-to-RL pipeline — created an 8B model that outperforms much larger systems and generalizes impressively.

This work offers a clear recipe for the next generation of open-source web agents:
Autonomous synthesis of hard queries + progressive RL training = powerful, long-horizon problem solvers ready to tackle the complexities of real-world AI assistance.