AI is getting incredibly good at research. Systems like OpenAI’s DeepResearch and Google’s Gemini can now tackle complex questions by searching the web, reading documents, and synthesizing information over multiple steps. These Deep Research agents are pushing the boundaries of what AI can do. But they have a huge blind spot: they are almost entirely text-based.

In the real world—and especially on the web—information isn’t just text. It’s in charts, diagrams, product images, screenshots, and infographics. An agent that can’t see is missing half the story. The next great frontier for AI agents is combining vision and language to perform truly comprehensive research.

Tackling this challenge is much harder than it sounds. A multimodal agent needs more than just eyes; it requires sophisticated reasoning to connect what it sees with what it reads. It needs to master a wider array of tools for perception, knowledge retrieval, computation, and cross-modal inference. Existing approaches have often been limited to simple visual tasks or rigid, template-driven workflows.

A new paper from Alibaba Group introduces WebWatcher, a vision-language agent designed to overcome these limitations. WebWatcher learns to perform deep research by leveraging high-quality, automatically generated training data, a versatile set of tools, and a two-stage training process mixing supervised learning and reinforcement learning.

To benchmark its capabilities, the authors also created BrowseComp-VL—an incredibly challenging set of multimodal research tasks. And the results? WebWatcher doesn’t just edge past the competition; it decisively outperforms proprietary models like GPT-4o and Gemini-2.5-flash on several difficult multimodal evaluations.


The Blind Spot of Modern AI Agents

To understand WebWatcher’s impact, we first need to look at why current agents often fail at vision-language reasoning. Many fall into one of two traps:

  • Vision specialists that can see but not reason deeply.
  • Text specialists that can reason but not see properly.

The researchers illustrate this perfectly using a tough case from the GAIA benchmark: the agent must find the number of “visual edit” tags on a specific animal’s Wikipedia page (from before 2020), starting only with a picture of the animal.

A diagram comparing three types of agents—a VL Agent, WebWatcher, and a Search Agent—attempting to answer a question based on an image of a puffin. WebWatcher is the only one to succeed by using a multi-step, multi-tool reasoning process.

Figure 1: A comparison of agent approaches to a complex vision-language task. Only WebWatcher’s integrated, multi-tool reasoning yields the correct answer. (Original paper Figure 2)

Here’s what happens:

  • Standard VL Agent — Identifies the animal incorrectly (“looks like a pelican”) and gets stuck. It over-relies on shallow visual analysis without deep reasoning or web browsing capability.
  • Text-only Search Agent — Can’t ground its search in the provided image. Guesses (“maybe a penguin or seagull”) and ends up chasing irrelevant searches, missing the answer.
  • WebWatcher — Performs genuine multi-step reasoning, combining OCR, web search, page visiting, and cross-validation in a flexible loop until it reaches the correct answer.

The lesson is clear: solving complex real-world problems requires deep reasoning across modalities and effective use of diverse tools—exactly what WebWatcher is built for.


Building a Better Training Ground: The BrowseComp-VL Benchmark

You can’t train a sophisticated agent without sophisticated data. Yet most existing Visual Question Answering (VQA) datasets focus on simple perception (“What color is the car?”) or single-step lookups. They lack the multi-hop reasoning and strategic planning needed for real deep research.

To bridge this gap, the researchers created BrowseComp-VL—a benchmark for advanced multimodal reasoning in realistic web environments.

Two donut charts showing the domain distribution for Level\u00a01 and Level\u00a02 of the BrowseComp-VL dataset, with example questions overlaid.

Figure 2: Domain distribution for the two difficulty levels in BrowseComp-VL. (Original paper Figure 3)

BrowseComp-VL covers five major domains—Entertainment, Humanities, Technology, Natural Science, and Other—and features two difficulty levels:

  • Level 1:
    Questions require multi-hop reasoning but reference explicit entities. Answers can be retrieved iteratively, but integrating multiple sources is still necessary.

  • Level 2:
    Entities are fuzzified or obscured. Instead of naming “James Roy Kinghorn,” the question might refer to “a prominent zoologist associated with the Captain Fortune Show.” This forces the agent to infer and synthesize information, not just retrieve it.


Generating the Data

Creating tens of thousands of such examples requires automation. The team designed a scalable pipeline:

A flowchart illustrating the data generation pipeline, moving from a knowledge graph to image search and finally to Level\u00a01 and Level\u00a02 VQA questions.

Figure 3: Automated pipeline for generating high-quality VQA pairs. (Original paper Figure 4)

Step 1: Generate Complex Textual QA Pairs
The system builds a knowledge graph from authoritative domains (e.g., Wikipedia), traversing hyperlinks to generate multi-hop reasoning chains. Level 2 questions are fuzzified—specific details replaced with vague descriptions.

Step 2: Convert Text QA to VQA
For each QA pair, the key entity is identified and real-world images retrieved via web image search. The question is rewritten to reference “the entity in the image” rather than naming it outright.

Step 3: Quality Control
A Selector filters out poorly rewritten or irrelevant questions/images. An Examiner tests whether the question is answerable from the given image and associated textual context. All checks are powered by GPT-4o for alignment and consistency.

This pipeline produces a rich, challenging dataset ideally suited for training multimodal research agents.


Training a Multimodal Research Agent

With the dataset ready, the focus turns to teaching the agent to reason and act.

The Agent’s Toolkit

WebWatcher comes equipped with five tools:

  1. Web Image Search — retrieve images, captions, and related URLs.
  2. Web Text Search — open-domain text retrieval from the web.
  3. Visit — navigate to a URL and summarize content relevant to a goal.
  4. Code Interpreter — perform calculations, parsing, and symbolic reasoning.
  5. OCR — extract embedded text from images.

Stage 1: Learning from Expert-like Trajectories

The researchers use GPT-4o to automatically create “trajectories”—scripted sequences of thought, tool use, and observations—following the ReAct framework:

\[ \tau = \{ (t_0, o_0), (t_1, o_1), \dots, (t_L, o_L) \} \]

Here \(t_i\) is a tool/action at step \(i\), and \(o_i\) is its output.

Only trajectories that:

  • End with the correct answer
  • Are logically consistent step-by-step
  • Use at least three tool calls

are kept. This ensures learning from realistic, non-trivial reasoning.


Stage 2: Two-Phase Training — SFT Cold Start + RL Refinement

Phase 1: Supervised Fine-Tuning (SFT)
Filtered trajectories are used to fine-tune WebWatcher to predict the correct next action:

\[ \max_{\theta} \sum_{i=1}^{K} \sum_{l=1}^{L_{i}} \log P_{\theta} \left( t_{l}^{(i)} | I^{(i)}, q^{(i)}, t_{This “cold start” stage teaches tool syntax, sequencing, and multi-step reasoning before autonomous learning kicked in.

Phase 2: Reinforcement Learning with GRPO
Using Group-Relative Policy Optimization, the agent generates \(K\) trajectories per query, scores them, and updates its policy based on their relative success:

\[ A_{\rm rel}(\tau^{(i)}) = R^{(i)} - \frac{1}{K} \sum_{j=1}^{K} R^{(j)} \]

Total reward combines formatting correctness (\(r_f\)) and answer accuracy (\(r_a\)):

\[ R = w r_f + (1 - w) r_a \]

GRPO encourages better overall reasoning paths, not just lucky steps.


Results: A New State-of-the-Art

WebWatcher was tested against direct inference, RAG workflows, and other agents, including GPT-4o, Gemini-2.5-flash, Claude-3.7, OmniSearch, and Qwen2.5-VL variants.

Bar charts comparing WebWatcher-32B’s performance against GPT-4, Gemini 2.5-flash, Qwen2.5-VL-7B, and Claude-3.7 across four benchmarks: HLE-VL, BrowseComp-VL, LiveVQA, and MMSearch. WebWatcher consistently leads.

Figure 4: Overall performance of WebWatcher compared to other models across four reasoning benchmarks. (Original paper Figure 1)

On Humanity’s Last Exam (HLE)—a highly demanding academic benchmark—direct inference models barely crack 10%. RAG helps, but the best results come from agentic multi-step reasoning. WebWatcher-32B ranks highest.

Table of results on the HLE benchmark, showing WebWatcher-32B achieving the highest average score (13.6) compared to various models using direct inference, RAG, and other agentic approaches.

Table 1: HLE results. WebWatcher leads among agents, surpassing strong GPT-4o baselines. (Original paper Table 1)

The pattern repeats across other benchmarks:

Table of results on BrowseComp-VL, LiveVQA, MMSearch, and SimpleVQA. WebWatcher-32B consistently achieves the highest scores among all agents.

Table 2: Results on BrowseComp-VL, LiveVQA, MMSearch, and SimpleVQA. WebWatcher consistently tops the charts. (Original paper Table 2)

On BrowseComp-VL, most baselines score below 20%, showing how challenging it is. WebWatcher’s dynamic tool loop shines here. On LiveVQA (up-to-date visual knowledge) and MMSearch (multimodal search), it again leads decisively.


Why WebWatcher Succeeds

Flexible Tool Use

Six bar charts showing the percentage of tool calls (Text search, Image search, Code, Visit) across five benchmarks and overall.

Figure 5: Tool usage patterns vary by benchmark, showing adaptive strategy. (Original paper Figure 5)

On HLE, it uses a balanced mix of text search, image search, and code. On BrowseComp-VL, text search dominates (62%). For visual-heavy tasks like LiveVQA/MMSearch, image search takes priority. WebWatcher adapts its tool selection to match the challenge.


Cold Start Is Critical

Three line charts comparing RL training with and without SFT cold start. Cold start models consistently outperform.

Figure 6: RL training with and without SFT cold start. Without SFT, learning stalls. (Original paper Figure 6)

Models starting without SFT (“Instruct”) barely improve under RL, often making tool formatting errors. Cold-start models begin stronger and improve steadily—especially on LiveVQA—proving SFT is vital.


Scaling with Pass@k

A line chart showing Pass@k performance of WebWatcher on HLE: ~13% at k=1 to over 40% at k=32.

Figure 7: Pass@k scaling on HLE. Performance multiplies with more diverse attempts. (Original paper Figure 7)

Even with one attempt (\(k=1\)), WebWatcher beats most baselines. With multiple diverse tries, performance jumps—35.7% at \(k=16\), 41.9% at \(k=32\). Because attempts are decorrelated, each adds meaningful coverage.


Conclusion: Seeing, Reading, Reasoning—At Research Level

WebWatcher is a milestone for multimodal AI agents:

  1. The Challenge is Real — Combining vision and language for deep research is more than adding visual tools; it demands integration and reasoning.
  2. Data Is King — The BrowseComp-VL benchmark and its scalable, quality-controlled generation pipeline make robust training and evaluation possible.
  3. Training Matters — The SFT cold start plus GRPO reinforcement learning yields flexible, multi-tool reasoning skills.

By equipping agents to see, read, and reason, WebWatcher offers a blueprint for systems that can tackle complex, real-world problems autonomously—leveraging the full multimodal richness of knowledge on the web. It’s a powerful glimpse into the future of AI research.