Imagine an AI assistant that could use your computer just like a human. It could book your travel, create a presentation from your notes, or manage your files — all by directly interacting with the graphical user interface (GUI): clicking icons, typing in text boxes, and dragging files. This is the promise of computer use agents — autonomous AI systems with the potential to automate countless digital tasks and dramatically boost productivity.

But building these agents is hard. For an AI, a computer screen isn’t a neat list of commands — it’s a chaotic collection of pixels. Current agents often struggle with three fundamental challenges:

  1. Imprecise Grounding:
    They have difficulty mapping a command like “click the save icon” to the exact pixel coordinates of that button. Misclicks and missed selections are common.

  2. Long-Horizon Planning:
    Multi-step tasks — like switching between applications, handling pop-ups, or adapting to slight interface changes — often trip agents up. They get lost or stuck mid-task.

  3. The Generalist Bottleneck:
    Most approaches rely on a single massive “generalist” model to handle everything from big-picture planning to low-level click execution. This is like asking a CEO to also be the company’s designer and accountant — talented, maybe, but lacking the specialized skills to do every job perfectly.

A new research paper introduces Agent S2, a framework that tackles these problems head-on. Instead of relying on one all-purpose model, it treats computer use as a job for a team — combining generalist models for planning with specialist models for precise interaction. The result? State-of-the-art performance, significantly outperforming previous methods.

A chart showing Agent S2 outperforming several other state-of-the-art computer use agents on the OSWorld benchmark for both 15-step and 50-step tasks.

Figure 1: Agent S2 sets new SOTA results for success rate on OSWorld across both 15-step and 50-step evaluations.


Background: Two Paths to Building Computer Use Agents

Before diving into Agent S2’s architecture, let’s review the dominant approaches today.

1. The Monolithic Approach

This uses a single, powerful end-to-end model (often a giant multimodal large language model). You feed it a screenshot and a user instruction, and it outputs the next action — e.g., “click coordinates (450, 120)”.

Pros:

  • Simple pipeline
  • Broad, general knowledge

Cons:

  • “Jack of all trades, master of none.” Fine-tuning for specialized skills like UI grounding often degrades general reasoning ability.
  • Requires vast, expensive datasets covering every possible interaction.

2. The Hierarchical Approach

This decomposes the problem into higher and lower levels:

  • Manager model: Acts like a project manager. Breaks the user request (“Find the latest sales report and email it”) into subgoals (“1. Open file explorer. 2. Go to ‘Reports’ folder. 3. Find file. 4. Open email client…”).
  • Worker model: Executes each subgoal (“Click the ‘Files’ icon on the dock”) using atomic actions.

This reduces cognitive load but still faces bottlenecks:

  • Workers must both describe an action (“click the icon”) and figure out the exact location — a grounding burden that limits performance.
  • Typical hierarchical systems use reactive planning — they stick to the original plan unless something fails, making them brittle in dynamic environments.

Inside Agent S2: Two Big Ideas

Agent S2 builds on the hierarchical approach but adds two transformative innovations: Mixture of Grounding (MoG) and Proactive Hierarchical Planning (PHP).

It’s a compositional system where generalist models handle the what and why (planning), and specialist models handle the how and where (grounding and execution).

Diagram illustrating Manager (M) generating plans, Worker (W) executing them, and Mixture of Grounding (MoG) specialists G1, G2, G3 producing precise actions.

Figure 2: Agent S2’s compositional framework integrates generalist planners with specialist grounding experts for precise UI interactions.


1. Mixture of Grounding (MoG): The Right Tool for the Job

The “grounding bottleneck” is a key obstacle. Some tasks — like selecting a specific phrase — require different capabilities than clicking a button.

Agent S2’s solution: route each action to the right expert. The Worker decides what to do and sends it to one of three specialists:

  1. Visual Grounding Expert:
    Takes a screenshot and description (“blue login button”), outputs precise coordinates for clicks/drags. Used for most general interactions.

  2. Textual Grounding Expert:
    Uses OCR to locate exact character positions for highlighting text spans — more accurate than visual matching for fine-grained selections.

  3. Structural Grounding Expert:
    Handles spreadsheets/tables programmatically. Commands like set_cell_values({"D2": "=B2-C2"}) update cells without relying on potentially fragile clicks.

By delegating the grounding task to the right expert, Agent S2 executes a wider range of actions with higher precision.


2. Proactive Hierarchical Planning (PHP): Adapt or Perish

Digital environments change constantly. Static plans crumble when faced with unexpected pop-ups or altered layouts.

Agent S2’s proactive loop works like this:

Comparison chart: reactive planning follows a fixed initial plan; proactive planning updates the plan after each completed subgoal.

Figure 3: Proactive planning reassesses and updates after every subgoal, while reactive planning only changes after a failure.

  1. Manager: Observes the current state and user request, sets initial subgoals \(\{g'_1, g'_2, g'_3, \dots\}\).
  2. Worker: Executes the first subgoal, using MoG experts for grounding.
  3. Manager: Reassesses after each subgoal’s completion, producing an updated remaining plan \(\{g''_2, g''_3, \dots\}\).
  4. Repeat until the task is complete.

This constant adaptation lets Agent S2 insert new steps when needed, recover gracefully from changes, and adjust course without waiting for failure.


Example: Selecting text in a document. Agent S2 first tries the visual expert, fails, then reroutes to the textual expert for precision, and replans subsequent actions accordingly.

Example: Agent S2 switches from visual to textual grounding for precise selection.

Figure 4: Agent S2 self-corrects mid-task by switching grounding experts, then replans based on the updated state.


Experiments: Testing Agent S2

Researchers tested Agent S2 on three major benchmarks:

  • OSWorld (Ubuntu desktop tasks)
  • WindowsAgentArena (Windows OS tasks)
  • AndroidWorld (smartphone UI tasks)

OSWorld: New State-of-the-Art

Table showing success rates on OSWorld, with Agent S2 scoring highest across both 15-step and 50-step tasks.

Table 1: OSWorld benchmark results — Agent S2 significantly outperforms prior agents in both short and long horizons.

With Claude-3.7-Sonnet as its backbone, Agent S2 hit 34.5% success on complex 50-step tasks — a 32.7% relative improvement over the previous best. Even with the less powerful Claude-3.5-Sonnet, it beat monolithic agents using more advanced models.


WindowsAgentArena: Cross-OS Generalization

Table showing Agent S2 achieving 29.8% success rate, outperforming NAVI and Agent S.

Table 3: On WindowsAgentArena, Agent S2 surpasses the previous best by 52.8%, excelling in Windows-specific tasks.


Why It Works: Ablation Studies

To confirm MoG and PHP’s impact, researchers ran ablations — removing components to see performance drop.

Line chart showing drop in success rate when MoG or PHP are removed.

Figure 5: Both MoG and PHP are crucial, especially for longer tasks.

Findings:

  • Removing MoG or PHP significantly reduced success rates.
  • Their benefits grow with task length — essential for long-horizon problem solving.

Also: Specialized, smaller visual grounding models integrated into the framework can outperform giant generalist models on grounding tasks.

Bar chart comparing visual grounding models in Agent S2. Specialized models outperform a generalist’s grounding performance.

Figure 6: Specialist grounding models outperform generalist models when paired with Agent S2.


Error Analysis: Moving the Bottleneck

Pie chart of failure categories: planning 41%, grounding 20.5%, interaction 17.9%, infeasible 10.3%, navigation 10.3%.

Figure 8: Planning, not grounding, is now the main failure mode.

Past agents failed primarily at grounding. MoG reduced this dramatically. Now, planning quality is the biggest challenge — a sign that interaction precision is largely solved, shifting focus to deeper reasoning.


Conclusion: Lessons from Agent S2

Agent S2 is a leap forward for autonomous computer agents. By replacing the monolithic blueprint with a compositional framework, it combines the adaptability of proactive planning with the precision of specialist grounding.

Key takeaways:

  1. Specialization Works: Dedicated experts for visual, textual, and structural grounding boost precision.
  2. Adaptability is Essential: Proactive planning enables dynamic course correction.
  3. Composition is Powerful: Coordinating strong generalists with targeted specialists outperforms “all-in-one” giants.

Agent S2’s success foreshadows the future: AI assistants will be teams — orchestras of specialized models — not lone virtuosos. With UI grounding solved, the next frontier is enhancing long-term reasoning, bringing us closer to truly capable digital helpers.