Artificial Intelligence is becoming remarkably adept at using computers. We now have AI systems capable of booking flights, managing spreadsheets, and editing photos by directly controlling a graphical user interface (GUI) — just like a human user. These Computer-Use Agents (CUAs) promise to automate countless tedious digital tasks.
But there’s a catch: while they can be impressive, they often prove fragile. A single small mistake in a long series of actions — clicking the wrong button, misinterpreting a menu, or being thrown off by a pop-up — can derail the entire task. For complex, multi-step workflows, this unreliability is a major obstacle. The same agent might succeed flawlessly one time and fail spectacularly the next, resulting in frustratingly high variance that limits practical deployment.
What if, instead of relying on a single, imperfect agent, we could run several agents in parallel and simply select the best outcome? This scaling approach sounds simple, but it raises a tough question: how do you automatically determine which attempt is truly the “best”?
A new research paper from Simular Research tackles this challenge head-on. The authors introduce Behavior Best-of-N (bBoN), a framework that makes scaling CUAs not merely feasible, but unreasonably effective. Their method achieves a new state-of-the-art on the challenging OSWorld benchmark — a 10% absolute performance boost, reaching 69.9% success and coming within a hair’s breadth of human-level performance (≈72%).
Figure 1: Performance on OSWorld at 100 steps. The bBoN method beats the previous SoTA by 10% absolute improvement, nearly attaining human-level performance.
In this article, we’ll explore why scaling agents is fundamentally challenging, how the bBoN framework solves the evaluation problem using behavior narratives, and why their results represent a leap forward in building robust AI assistants.
Why AI Agents Falter — and the Scaling Dilemma
To understand the paper’s contributions, let’s first look at how a typical CUA operates. You can think of it as an agent solving a puzzle:
- The agent sees an observation \(o_t\) — a screenshot of the desktop.
- It takes an action \(a_t\) — e.g.,
agent.click(x, y)
. - It receives a new observation and continues until the task (defined by a user instruction \(I\)) is complete.
Traditionally, research has focused on improving the agent’s policy (\(\pi\)) — the “brain” that decides which action to take given the instruction and observation history. The aim: produce one highly capable agent that outputs a single successful trajectory of actions.
Even the best policies are probabilistic, meaning they can fail unpredictably. One common strategy to boost reliability is test-time scaling — generating multiple candidate solutions and selecting the best one.
Some methods do this step-wise — at each step, the agent proposes several possible actions, and a “judge” chooses one before proceeding. While this can help resolve local uncertainties, it commits early to a single path. If the agent chooses a harder or suboptimal route early, there’s no opportunity to switch to an easier one that might succeed.
The authors explore a more powerful alternative: trajectory-level Best-of-N, or wide scaling — where multiple agents run from start to finish, generating complete solution trajectories, and the best overall trajectory is selected.
Figure 2: Disjoint task successes across three independent agent rollouts. bBoN combines their strengths by selecting the most promising trajectory.
This approach exploits the fact that different agents — or even different runs of the same agent — often fail in different ways but succeed on complementary sets of tasks. By producing multiple rollouts, you increase the odds that at least one is correct.
The challenge lies in evaluating and comparing full trajectories. A single trajectory can involve hundreds of steps, each with a high-resolution screenshot. This is dense, multimodal data, much of it irrelevant to task success. And many computer tasks have multiple valid completion paths. How can an automated judge efficiently identify the correct trajectory? This is exactly the problem bBoN solves.
The Behavior Best-of-N Framework
The bBoN framework addresses wide scaling with a two-part solution:
- Convert complex, noisy trajectories into concise, structured behavior narratives.
- Use a powerful Vision-Language Model (VLM judge) to compare these narratives holistically and pick the winner.
Figure 3: Multiple rollouts are converted to behavior narratives that distill action effects. A VLM judge then performs comparative evaluation to select the best.
1. From Raw Trajectories to Behavior Narratives
The insight: to understand what happened, you don’t need every pixel of every screenshot — you need a clear record of changes caused by actions.
The Behavior Narrative Generator takes each transition — “before” screenshot \(s_i\), action \(a_i\), “after” screenshot \(s_{i+1}\) — and generates a short fact (\(\phi_i\)) describing the change:
- Clicked on the “Insert” menu.
- Opened the Pivot Table dialog.
- Typed “Sales Summary” into the sheet name field.
For precise pointer actions (clicks, drags), the team overlays a marker on the “before” image showing the target location, and provides a zoomed-in crop from the “after” image centered on the pointer. This helps the VLM verify that the effect occurred.
The final narrative is \(\tilde{\tau} = (s_0, \phi_0, \phi_1, \dots, \phi_{T-1}, s_T)\), including the start/end screenshots and the action-effect chain — filtering out irrelevant visual noise.
2. The bBoN Judge
Once converted to clean narratives, comparing candidates becomes easier. The bBoN Judge receives all \(N\) narratives at once in a single, multiple-choice style prompt:
“Given the user’s request and these \(N\) summaries of what different agents did, which one best completes the task?”
Comparing all candidates simultaneously enables direct contrasts between approaches, identification of subtle differences, and more informed selection. The judge is also instructed to cite specific facts from the narratives, ensuring evidence-based reasoning.
3. Agent S3 — A Stronger Foundation
Scaling works best when the base agents already produce high-quality rollouts. The authors developed Agent S3, an improved agentic framework building on Agent S2:
- Integrated Coding Agent: At any step, Agent S3 can choose GUI actions or invoke a coding agent capable of executing Python/Bash for heavy-lift tasks like bulk data operations or file transformations.
- Flat Policy: Instead of a multi-level (manager-worker) hierarchy, Agent S3 uses a single capable model to plan step-by-step, enabling faster and more adaptable decision-making.
These improvements make Agent S3 a state-of-the-art agent by itself and an excellent base for bBoN.
Table 2: Agent S3 vs. Agent S2 — improvements in success rate and efficiency.
Experiments and Results
The team benchmarked bBoN primarily on OSWorld, a suite of hundreds of real-world Ubuntu tasks across multiple domains (Office, OS operations, workflow tasks).
Main OSWorld Results
Table 1: OSWorld success rates (100 steps). bBoN lifts Agent S3’s SoTA performance by +7.3% with GPT-5, reaching 69.9% — near human level.
Agent S3 alone outperforms previous SoTA systems. With bBoN scaling (N=10 rollouts), success reaches 69.9% — closing most of the gap to humans (72% SR).
Scaling With More Rollouts
Figure 4: Success rate generally rises as \(N\) increases. Larger and smaller models both benefit.
Both GPT-5 and GPT-5 Mini improve as \(N\) grows, confirming wide scaling’s robustness.
Model Mixtures
Table 3: Diverse, capable ensembles (e.g., GPT-5 + Gemini 2.5 Pro) yield highest SR and task coverage.
Key insight:
- Stronger base models yield better outcomes.
- Diversity in ensembles expands coverage — increasing odds that one rollout will succeed.
Do Behavior Narratives Matter?
Table 4: Behavior narratives outperform baselines by ~3.4%, validating the action-effect representation.
Behavior narratives beat both naïve captioning and raw screenshot subsampling, proving the value of representing changes rather than static states.
Comparative vs. Independent Judging
Figure 5: Comparative judging markedly outperforms independent ranking — especially at higher \(N\).
Direct comparison within a single prompt scales more effectively than independent scoring, which plateaus quickly.
Generalizing Beyond Ubuntu
Table 6 & 7: Even at N=3, bBoN improves success rates across Windows and Android benchmarks.
bBoN delivered consistent gains across WindowsAgentArena and AndroidWorld, proving its principles apply broadly across operating systems.
Conclusion
The Unreasonable Effectiveness of Scaling Agents for Computer Use offers a pragmatic, high-impact recipe for boosting AI reliability:
Instead of chasing the elusive “perfect” agent, run several good ones and let a principled selection mechanism pick the best.
The Behavior Best-of-N (bBoN) framework is that mechanism:
- Narratives transform dense trajectories into action-effect summaries.
- Comparative judging ensures candidates are evaluated in context.
- Improved base agents (Agent S3) supply high-quality rollouts.
Together, these elements pushed OSWorld performance to near-human levels, with strong generalization to Windows and Android.
While the current approach assumes independent rollouts via VMs, extending it to real user desktops and shared online environments is an open challenge. Still, bBoN establishes a robust, scalable pattern for turning inconsistent CUAs into consistently high-performing assistants — a promising step toward everyday, dependable AI computer use.