AI is getting incredibly good at research. Systems like OpenAI’s DeepResearch and Google’s Gemini can now tackle complex questions by searching the web, reading documents, and synthesizing information over multiple steps. These Deep Research agents are pushing the boundaries of what AI can do. But they have a huge blind spot: they are almost entirely text-based.
In the real world—and especially on the web—information isn’t just text. It’s in charts, diagrams, product images, screenshots, and infographics. An agent that can’t see is missing half the story. The next great frontier for AI agents is combining vision and language to perform truly comprehensive research.
Tackling this challenge is much harder than it sounds. A multimodal agent needs more than just eyes; it requires sophisticated reasoning to connect what it sees with what it reads. It needs to master a wider array of tools for perception, knowledge retrieval, computation, and cross-modal inference. Existing approaches have often been limited to simple visual tasks or rigid, template-driven workflows.
A new paper from Alibaba Group introduces WebWatcher, a vision-language agent designed to overcome these limitations. WebWatcher learns to perform deep research by leveraging high-quality, automatically generated training data, a versatile set of tools, and a two-stage training process mixing supervised learning and reinforcement learning.
To benchmark its capabilities, the authors also created BrowseComp-VL—an incredibly challenging set of multimodal research tasks. And the results? WebWatcher doesn’t just edge past the competition; it decisively outperforms proprietary models like GPT-4o and Gemini-2.5-flash on several difficult multimodal evaluations.
The Blind Spot of Modern AI Agents
To understand WebWatcher’s impact, we first need to look at why current agents often fail at vision-language reasoning. Many fall into one of two traps:
- Vision specialists that can see but not reason deeply.
- Text specialists that can reason but not see properly.
The researchers illustrate this perfectly using a tough case from the GAIA benchmark: the agent must find the number of “visual edit” tags on a specific animal’s Wikipedia page (from before 2020), starting only with a picture of the animal.
Figure 1: A comparison of agent approaches to a complex vision-language task. Only WebWatcher’s integrated, multi-tool reasoning yields the correct answer. (Original paper Figure 2)
Here’s what happens:
- Standard VL Agent — Identifies the animal incorrectly (“looks like a pelican”) and gets stuck. It over-relies on shallow visual analysis without deep reasoning or web browsing capability.
- Text-only Search Agent — Can’t ground its search in the provided image. Guesses (“maybe a penguin or seagull”) and ends up chasing irrelevant searches, missing the answer.
- WebWatcher — Performs genuine multi-step reasoning, combining OCR, web search, page visiting, and cross-validation in a flexible loop until it reaches the correct answer.
The lesson is clear: solving complex real-world problems requires deep reasoning across modalities and effective use of diverse tools—exactly what WebWatcher is built for.
Building a Better Training Ground: The BrowseComp-VL Benchmark
You can’t train a sophisticated agent without sophisticated data. Yet most existing Visual Question Answering (VQA) datasets focus on simple perception (“What color is the car?”) or single-step lookups. They lack the multi-hop reasoning and strategic planning needed for real deep research.
To bridge this gap, the researchers created BrowseComp-VL—a benchmark for advanced multimodal reasoning in realistic web environments.
Figure 2: Domain distribution for the two difficulty levels in BrowseComp-VL. (Original paper Figure 3)
BrowseComp-VL covers five major domains—Entertainment, Humanities, Technology, Natural Science, and Other—and features two difficulty levels:
Level 1:
Questions require multi-hop reasoning but reference explicit entities. Answers can be retrieved iteratively, but integrating multiple sources is still necessary.Level 2:
Entities are fuzzified or obscured. Instead of naming “James Roy Kinghorn,” the question might refer to “a prominent zoologist associated with the Captain Fortune Show.” This forces the agent to infer and synthesize information, not just retrieve it.
Generating the Data
Creating tens of thousands of such examples requires automation. The team designed a scalable pipeline:
Figure 3: Automated pipeline for generating high-quality VQA pairs. (Original paper Figure 4)
Step 1: Generate Complex Textual QA Pairs
The system builds a knowledge graph from authoritative domains (e.g., Wikipedia), traversing hyperlinks to generate multi-hop reasoning chains. Level 2 questions are fuzzified—specific details replaced with vague descriptions.
Step 2: Convert Text QA to VQA
For each QA pair, the key entity is identified and real-world images retrieved via web image search. The question is rewritten to reference “the entity in the image” rather than naming it outright.
Step 3: Quality Control
A Selector filters out poorly rewritten or irrelevant questions/images. An Examiner tests whether the question is answerable from the given image and associated textual context. All checks are powered by GPT-4o for alignment and consistency.
This pipeline produces a rich, challenging dataset ideally suited for training multimodal research agents.
Training a Multimodal Research Agent
With the dataset ready, the focus turns to teaching the agent to reason and act.
The Agent’s Toolkit
WebWatcher comes equipped with five tools:
- Web Image Search — retrieve images, captions, and related URLs.
- Web Text Search — open-domain text retrieval from the web.
- Visit — navigate to a URL and summarize content relevant to a goal.
- Code Interpreter — perform calculations, parsing, and symbolic reasoning.
- OCR — extract embedded text from images.
Stage 1: Learning from Expert-like Trajectories
The researchers use GPT-4o to automatically create “trajectories”—scripted sequences of thought, tool use, and observations—following the ReAct framework:
\[ \tau = \{ (t_0, o_0), (t_1, o_1), \dots, (t_L, o_L) \} \]Here \(t_i\) is a tool/action at step \(i\), and \(o_i\) is its output.
Only trajectories that:
- End with the correct answer
- Are logically consistent step-by-step
- Use at least three tool calls
are kept. This ensures learning from realistic, non-trivial reasoning.
Stage 2: Two-Phase Training — SFT Cold Start + RL Refinement
Phase 1: Supervised Fine-Tuning (SFT)
Filtered trajectories are used to fine-tune WebWatcher to predict the correct next action:
Phase 2: Reinforcement Learning with GRPO
Using Group-Relative Policy Optimization, the agent generates \(K\) trajectories per query, scores them, and updates its policy based on their relative success:
Total reward combines formatting correctness (\(r_f\)) and answer accuracy (\(r_a\)):
\[ R = w r_f + (1 - w) r_a \]GRPO encourages better overall reasoning paths, not just lucky steps.
Results: A New State-of-the-Art
WebWatcher was tested against direct inference, RAG workflows, and other agents, including GPT-4o, Gemini-2.5-flash, Claude-3.7, OmniSearch, and Qwen2.5-VL variants.
Figure 4: Overall performance of WebWatcher compared to other models across four reasoning benchmarks. (Original paper Figure 1)
On Humanity’s Last Exam (HLE)—a highly demanding academic benchmark—direct inference models barely crack 10%. RAG helps, but the best results come from agentic multi-step reasoning. WebWatcher-32B ranks highest.
Table 1: HLE results. WebWatcher leads among agents, surpassing strong GPT-4o baselines. (Original paper Table 1)
The pattern repeats across other benchmarks:
Table 2: Results on BrowseComp-VL, LiveVQA, MMSearch, and SimpleVQA. WebWatcher consistently tops the charts. (Original paper Table 2)
On BrowseComp-VL, most baselines score below 20%, showing how challenging it is. WebWatcher’s dynamic tool loop shines here. On LiveVQA (up-to-date visual knowledge) and MMSearch (multimodal search), it again leads decisively.
Why WebWatcher Succeeds
Flexible Tool Use
Figure 5: Tool usage patterns vary by benchmark, showing adaptive strategy. (Original paper Figure 5)
On HLE, it uses a balanced mix of text search, image search, and code. On BrowseComp-VL, text search dominates (62%). For visual-heavy tasks like LiveVQA/MMSearch, image search takes priority. WebWatcher adapts its tool selection to match the challenge.
Cold Start Is Critical
Figure 6: RL training with and without SFT cold start. Without SFT, learning stalls. (Original paper Figure 6)
Models starting without SFT (“Instruct”) barely improve under RL, often making tool formatting errors. Cold-start models begin stronger and improve steadily—especially on LiveVQA—proving SFT is vital.
Scaling with Pass@k
Figure 7: Pass@k scaling on HLE. Performance multiplies with more diverse attempts. (Original paper Figure 7)
Even with one attempt (\(k=1\)), WebWatcher beats most baselines. With multiple diverse tries, performance jumps—35.7% at \(k=16\), 41.9% at \(k=32\). Because attempts are decorrelated, each adds meaningful coverage.
Conclusion: Seeing, Reading, Reasoning—At Research Level
WebWatcher is a milestone for multimodal AI agents:
- The Challenge is Real — Combining vision and language for deep research is more than adding visual tools; it demands integration and reasoning.
- Data Is King — The BrowseComp-VL benchmark and its scalable, quality-controlled generation pipeline make robust training and evaluation possible.
- Training Matters — The SFT cold start plus GRPO reinforcement learning yields flexible, multi-tool reasoning skills.
By equipping agents to see, read, and reason, WebWatcher offers a blueprint for systems that can tackle complex, real-world problems autonomously—leveraging the full multimodal richness of knowledge on the web. It’s a powerful glimpse into the future of AI research.