Imagine an AI that can use your computer just like you do—browsing websites, managing files, playing games, and even writing code. This isn’t science fiction; it’s the frontier of AI research, where GUI agents are being developed to autonomously operate graphical user interfaces.

But building such an agent is incredibly hard. How do you gather enough training data?
How do you teach it to learn from mistakes over long, complex tasks?
And how do you create a stable environment for it to practice in without constant crashes?

A new technical report from ByteDance introduces UI-TARS-2, a powerful GUI agent that tackles these challenges head-on. The paper outlines a systematic methodology for training highly capable agents that not only excel at traditional computer tasks but also generalize to dynamic environments like web-based video games.

Let’s explore how they did it.

A demonstration of UI-TARS-2 playing the game Hex, searching for rules, and then creating a webpage about its experience — all within a unified sandbox environment.


The Four Hurdles of Building a Digital Butler

Before UI-TARS-2, creating a general-purpose GUI agent was like building a skyscraper on shaky ground.
The authors identify four fundamental challenges:

  1. Data Scarcity: Unlike text or code, there’s no massive, pre-existing corpus of “computer usage” data. Collecting high-quality, step-by-step demonstrations is expensive and slow.
  2. Unstable Reinforcement Learning (RL): Teaching an agent through trial-and-error RL is difficult for long tasks. Rewards are often delayed, making it hard to know which actions helped or harmed.
  3. The GUI-Only Trap: Many real-world tasks can’t be solved with just clicks and typing. Sometimes you need back-end tools—like opening a terminal or running a script.
  4. Fragile Environments: Running millions of interactive sessions requires reliable infrastructure. Virtual machines and browsers can be slow, buggy, and hard to maintain.

UI-TARS-2 is built on a four-pillar methodology designed to systematically overcome each of these hurdles.


The Core of UI-TARS-2: A Recipe for a Smarter Agent

At its heart, UI-TARS-2 follows the ReAct paradigm—a loop of Reasoning and Acting.

At each step, the agent:

  1. Observes the current screen and goal.
  2. Thinks about the next step (reasoning).
  3. Acts via GUI operations (click, type) or tool calls (like terminal commands).

This cycle repeats until the task is complete.
A full task—called a trajectory—is a sequence of these thought–action–observation steps:

\[ \tau = \{(t_0, a_0, o_0), (t_1, a_1, o_1), \dots, (t_T, a_T, o_T)\} \]

For long tasks, the agent uses a hierarchical memory system:

  • Working Memory: High-fidelity record of recent steps.
  • Episodic Memory: Compressed summaries for long-term recall.

The agent’s policy then predicts the next thought and action based on memory, instruction, and observations:

\[ P(t_n, a_n \mid \text{instruction}, \mathcal{W}_n, o_n, \mathcal{E}_n) \]

Pillar 1: The All-in-One GUI Sandbox

To train a versatile agent, you need a versatile playground.

The team built a unified sandbox platform that supports:

  • Windows, Ubuntu, and Android desktop/mobile apps
  • Browser-based environments
  • Hybrid workflows mixing GUI interaction with backend tools (GUI-SDK).

Key features include:

  • Thousands of cloud virtual machines enabling large-scale training with stable, reproducible sessions.
  • File-system and terminal integration, allowing workflows like downloading a file in a browser and immediately processing it with shell commands.
  • Cross-platform GUI automation via PyAutoGUI and ADB.

For web-based games, they built a hardware-accelerated browser sandbox:

The architecture of the browser sandbox, showing how the client SDK interacts with the browser manager and multiple browser instances for high-throughput, stable rollouts.

It runs multiple Chrome instances per container using the Chrome DevTools Protocol. Clever optimizations like a “fake clock” allow time acceleration or pausing, boosting training throughput without altering game logic.


Pillar 2: The Data Flywheel

To solve data scarcity, the team designed a Data Flywheel—a self-reinforcing loop where the agent generates its own training data, improving with each cycle.

The Data Flywheel is an iterative process where the model generates new data, which is filtered and used to retrain the model, creating a virtuous cycle of improvement.

The cycle:

  1. Start with an initial dataset from online tutorials, human demos, and synthetic generation.
  2. Train in three stages:
    • Continual Pre-training (CT): Broad knowledge acquisition.
    • Supervised Fine-tuning (SFT): High-quality, task-specific instruction.
    • Reinforcement Learning (RL): Trial-and-error in interactive tasks.
  3. Generate new trajectories via the latest RL model.
  4. Filter & Recycle:
    • High-quality outputs → SFT dataset.
    • Lower-quality outputs → CT dataset.
  5. Repeat: Retrain with updated datasets.

This ensures no data is wasted and both the model and dataset improve together.


Pillar 3: Scalable Data Collection

The flywheel demands a constant flow of data—achieved via two annotation strategies.

In-situ Annotation (CT stage):
Annotation tools run directly on users’ own systems, collecting real-world behavior using a “think-aloud” protocol where annotators verbalize their thought process during tasks.

Interactive Annotation (SFT stage):
Human annotators oversee the model in live virtual environments.

The four-layer architecture of the interactive annotation platform, which manages tasks, virtual environments, and data storage for real-time human-in-the-loop data collection.

At each step:

  • The agent proposes a thought and action.
  • Humans can approve or correct it.

The interactive annotation workflow, where human annotators guide the agent’s actions in a live environment, providing on-policy corrections.

This “human-in-the-loop” pipeline produces on-policy data aligned with the model’s actual behavior.


Pillar 4: Stabilizing Multi-Turn Reinforcement Learning

Long-horizon RL is unstable and hard to scale. UI-TARS-2 addresses this with:

  1. Automatically generated, verifiable tasks for GUI-browsing, general web interaction, and games.
  2. Outcome Reward Models (ORM) to evaluate open-ended tasks based on final results.
  3. Asynchronous rollouts with streaming updates, avoiding bottlenecks from long-tail trajectories.

The multi-turn RL training infrastructure, featuring a policy server and environment server that enable asynchronous, streaming updates for efficient training.

At the algorithmic level, they use Proximal Policy Optimization (PPO):

\[ \mathcal{J}_{\mathrm{PPO}}(\theta) = \mathbb{E} \left[ \min\left( \frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)} \hat{A}_t, \text{clip}\left(\frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}, 1-\varepsilon, 1+\varepsilon\right) \hat{A}_t \right) \right] \]

Enhancements for Long Tasks

  • Value Pretraining: Offline training of the value model to convergence before PPO updates.
  • Decoupled & Length-Adaptive GAE: Better credit assignment for sequences of varying length.
  • Reward Shaping: Format rewards, length penalties, and intermediate signals.

Finally, multiple domain-specific models are merged via parameter interpolation:

\[ \theta^{\text{merge}} = \sum_k \alpha_k \cdot \theta^{(k)}, \quad \sum_k \alpha_k = 1,\ \alpha_k \ge 0 \]

Experiments & Results

GUI Benchmarks

On computer, mobile, and web tasks, UI-TARS-2 sets new state-of-the-art scores.

Table 1 shows UI-TARS-2’s performance on various GUI benchmarks, demonstrating significant improvements over previous versions and strong baselines across desktop, mobile, and web environments.

Highlights:

  • Significant boosts in out-of-domain performance—desktop and mobile benchmarks improved despite RL focusing mainly on browser tasks.
  • With GUI-SDK tool access, complex reasoning and software engineering tasks saw large gains (BrowseComp, SWE-Bench).

Game Benchmarks

In a collection of 15 web-based games:

Table 2 details the performance of UI-TARS-2 on a 15-game collection, where it achieves a mean normalized score of 59.8, significantly outperforming OpenAI CUA and Claude Computer Use.

  • UI-TARS-2 achieves ~60% of human-level performance on average.
  • Beats human averages on some games (e.g., Shapes).

Across out-of-domain LMGame-Bench:

Table 3 shows UI-TARS-2’s competitive performance on the out-of-domain LMGame-Bench, holding its own against top proprietary models in classic games.

It remains competitive with top proprietary models like OpenAI o3 and Gemini-2.5 Pro.


Insights from Training Dynamics

Reward & Entropy:
Training rewards rose steadily (Figure 7), while entropy often increased—indicating exploration across diverse strategies rather than converging on narrow ones.

Figure 7: Rewards consistently go up over time, while entropy also rises, suggesting the agent learns to explore diverse strategies.
Figure 8: Training entropy dynamics for GUI and game scenarios. Unlike reasoning tasks, entropy often rises, indicating increased exploration.

Thinking Less, Acting More:
As tasks were mastered, “think length” decreased in GUI environments—decisions became more direct.

Figure 9: The average think length per step. In GUI tasks, it drops as the agent matures; in games, it shows periodic spikes tied to difficulty.

Inference-Time Scaling:
UI-TARS-2 strongly improved with extra allowed steps, unlike baselines that plateau.

Figure 11: Inference-time scaling. UI-TARS-2’s performance continues to climb with more steps, unlocking subgoals.

Value Pretraining Benefits:

Figure 10b: Value pretraining results in consistently higher rewards.

Hybrid Training Advantage:

Figure 15: Hybrid training (GUI-only + GUI-SDK) outperforms single-interface training with better cross-task transfer.


Conclusion: A Blueprint for the Future of Agents

UI-TARS-2 isn’t just a more powerful model—it’s a methodology for building robust computer-use agents.

Key lessons:

  1. Systematic Approach Matters: Sandboxes, hybrid tools, flywheels, and stable RL combined unlock capability.
  2. Data Flywheels Win: Co-evolving model and data drives continuous gains.
  3. RL Can Scale: With the right infrastructure and algorithms, multi-turn RL is viable for complex tasks.
  4. Go Beyond GUI: Hybrid interaction modes vastly expand problem-solving ability.

While a perfect digital assistant is still a work in progress, UI-TARS-2 shows that with a solid foundation, rapid and tangible progress is possible. The principles here will likely guide the next generation of capable, reliable, and versatile AI agents.