AI agents are getting impressively good. They can search the web, book flights, and manage your calendar. But if you’ve ever used one, you know they still feel a bit… fragile. They operate in a world that conveniently pauses while they think — a luxury none of us have. The real world is messy, dynamic, and asynchronous—things happen whether our agent is ready or not.

This gap between sterile lab environments and the chaotic real world is one of the biggest hurdles holding back truly useful AI assistants.

Most benchmarks today test agents in sequential, turn-based settings. The agent gets a prompt, does some thinking, executes a tool, gets an observation, and repeats. The environment patiently waits. This is great for testing basic reasoning and tool use but misses a huge class of critical capabilities:

  • Adaptability: What happens if a friend replies to your message while the agent is in the middle of booking a restaurant?
  • Time-awareness: Can an agent send a follow-up email exactly three minutes after the first one, if that’s what you asked for?
  • Proactivity: Can an agent notice an important notification and act on it without being told?

To build agents that can handle these real-world challenges, we first need a way to create and test them in environments that reflect this complexity.

A new paper from Meta Superintelligence Labs, “ARE: scaling up agent environments and evaluations,” introduces a powerful platform to do just that. The researchers present two key contributions:

  1. ARE (Meta Agents Research Environments): A research platform for creating dynamic, asynchronous, and realistic simulated environments where time flows continuously and events happen independently of the agent.
  2. Gaia2: A challenging new benchmark built on ARE, designed to evaluate the next generation of agent capabilities, including adaptability, time management, and even collaboration with other agents.

This work argues that to push AI forward, we need to get serious about how we define tasks and measure success. Let’s dive into how ARE and Gaia2 are paving the way.


Unpausing the World: The ARE Platform

The core problem with existing agent environments is that they are tightly coupled to the agent’s actions. The world only changes when the agent does something. ARE flips this on its head with a simple but profound principle: “everything is an event.”

In ARE, the environment is a time-driven simulation that runs asynchronously from the agent. The clock is always ticking, and events can be scheduled to happen at any time—triggered by the user, the agent, or the environment itself. This design allows for the creation of rich, dynamic worlds that more closely resemble reality.

A flow-diagram of the ARE architecture. An external User and Agent interact with the main Environment. The Environment contains Apps, an Event Queue, an Event Loop, and a Notification System. A Scenario module provides the initial state, events, and verification logic.

Figure 2 ARE environments are event-based, time-driven simulations running asynchronously from the agent. Scenarios contain tasks and verification logic. Interactions can be tool calls or observations, all logged for precise analysis.

The architecture is built on five core concepts:

  1. Apps: The building blocks of an environment, like an Emails app or a Calendar app. Each app is a stateful collection of tools (send_email, create_event) that operate on its own data. This makes it easy to create reproducible environments where agent actions have consistent consequences.

  2. Environments: Collections of apps, their data, and rules governing interactions.

  3. Events: Any action or state change. An agent sending a message is an event; an email arriving from a friend is also an event scheduled by the simulation. Events are organized in dependency graphs, allowing complex patterns like parallel or conditional execution.

A flowchart showing how events can be scheduled with dependencies. Some events run in parallel, while others must wait for their predecessors to complete.

Figure 3 Event dependency graph illustrating ARE scheduling patterns, including parallel execution, prerequisites, and conditional actions.

  1. Notifications: The environment communicates with the agent through notifications, similar to smartphone alerts. Configurable policies determine which events generate notifications, allowing researchers to test proactivity — will the agent check for updates itself, or rely on notifications?

  2. Scenarios: Instead of static, single-turn tasks, ARE uses dynamic scenarios that unfold over time, with initial state, scheduled events, and verification logic.


Example Scenario

Imagine you ask: “Ask my mom for our family streaming password, and once you get it, forward it to my dad.”

In a traditional environment, the agent would send the message and then… awkwardly wait. In ARE, this becomes a natural multi-turn scenario:

A sequence diagram illustrating a multi-turn scenario in ARE. The agent sends a message, pauses, and is later reactivated by a new email notification from the environment, causing it to adapt its plan.

Figure 4 Multi-turn scenario: The agent pauses after sending the first message, then adapts its plan when a new email notification arrives asynchronously with the password.


Gaia2: A New Gauntlet for AI Agents

Leveraging ARE, the team built Gaia2 — a benchmark in a simulated Mobile environment packed with apps for email, messaging, contacts, calendars, shopping, and more, offering 101 tools for agent use.

Gaia2 has 1,120 verifiable scenarios simple for humans but challenging for current AI agents. It moves beyond basic search and execution to test a richer range of capabilities.


The Seven Capability Splits of Gaia2

  1. Search: Gather information across multiple apps.
    Example: “Which city do most of my friends live in based on chat history?”

  2. Execution: Carry out a sequence of write actions that update the environment’s state.
    Example: “Update the age of all contacts aged 24 or younger.”

  3. Adaptability: React when the environment changes due to the agent’s actions.
    Example: “Book a meeting with Kaida, but if she replies suggesting another time, reschedule.”

  4. Time: Act under explicit temporal constraints.
    Example: “Ask my colleagues who’s ordering the cab. If no answer in 3 minutes, order it yourself.”

  5. Ambiguity: Detect and handle unclear or contradictory requests rather than guess.
    Example: “Schedule daily yoga at 6 PM; let me know if conflicts occur.”

  6. Agent2Agent Collaboration: Replace certain apps with autonomous “app agents” that require coordination.
    Example: Contact and chat apps become agents you must message to get information.

A diagram showing how Agent2Agent scenarios work. Instead of calling tools directly, the main agent must communicate with specialized “app agents” to get tasks done.

Figure 9 In Agent2Agent scenarios, apps are replaced by autonomous agents. The main agent must collaborate through message passing, setting goals and coordinating.

  1. Noise: The environment introduces failures and irrelevant events to test robustness.

Verification: More Than the Final Answer

Benchmark success often means just getting the end result correct. But with agents that manipulate state — deleting, updating, creating — how they get there matters.

Gaia2 uses a robust verifier that checks the full sequence of write actions against a pre-annotated oracle solution:

  • Consistency: Correct tool and arguments, checked both exactly and semantically via an LLM judge.
  • Causality: Action dependencies respected.
  • Timing: Time-sensitive tasks executed within allowed windows.

An illustration of the trajectory matching process. A successful trace (bottom) maps all agent actions to the required oracle actions while respecting dependencies. A failed trace (top) cannot find a valid mapping.

Figure 6 Matching process for agent trajectories against oracle actions — success requires correct mapping, order, and timing.

This yields precise, reliable results — crucial for evaluations and RL training.

A table showing the ARE Verifier achieves 0.98 agreement with human labels, compared to 0.72 for a simpler in-context LLM verifier.

Table 1 ARE Verifier far outperforms a simpler in-context LLM judge baseline on 450 labeled trajectories.


Benchmark Results: Trade-offs Everywhere

Testing a mix of proprietary and open-source models with a standard ReAct scaffold revealed a clear leaderboard:

A bar chart showing the overall performance of various AI models on Gaia2. GPT-5 (high) achieves the top score, followed by Claude-4 Sonnet and Gemini 2.5-Pro.

Figure 8 Overall Gaia2 scores: Proprietary frontier models lead, with GPT-5 (high) at the top.

Performance varies by capability:

A detailed table of Pass@1 scores for each model across all seven Gaia2 capability splits.

Table 2 Pass@1 scores per model and capability split. Execution and Search are easiest; Ambiguity, Adaptability, and Time remain tough.


The Cost of Intelligence

Critical finding: stronger reasoning often sacrifices efficiency. Models excel at Execution/Search but falter on Adaptability and Ambiguity.

A grid of bar charts showing model performance on each capability. Models that are strong in Execution and Search often struggle in areas like Time and Ambiguity.

Figure 10 Capability-specific scores: strengths and weaknesses vary widely.

Spending more budget yields diminishing returns:

A line chart showing that as budget per scenario increases, all models’ success rates improve but eventually level off.

Figure 1 Budget scaling curves plateau — throwing money at the problem doesn’t guarantee progress.

Cost vs. efficiency metrics show the best “value” isn’t always highest accuracy:

Two plots comparing models. The left shows Overall Score vs. Average Cost; the right shows Time to Solve scenarios, with humans slower but thorough.

Figure 11 Left: Higher scores often come at higher cost. Right: Time-to-solve differs greatly between models and humans.


Inverse Scaling in Time Scenarios

A striking trend: top reasoning models perform worst under strict time limits — an inverse scaling law.

A chart (left) shows models improve dramatically on Time scenarios when latency is ignored (“Instant mode”). On the right, a scatter plot for GPT models shows higher Execution scores correlate with lower Time scores.

Figure 13 Left: Instant mode boosts Time scores, especially for reasoning-heavy models. Right: More skilled execution correlates with slower time-sensitive responses.

Why? Deep reasoning adds latency. Fast-and-smart remains rare; adaptive systems using small, quick models for urgent tasks may be the future.


How Collaboration Helps (Sometimes)

Multi-agent setups revealed that collaboration boosts weaker models like Llama 4 Maverick, reducing errors and stabilizing output:

A chart showing that for Llama 4 Maverick, increasing the agent-to-agent collaboration ratio reduces tool call errors.

Figure 14 For lighter models, Agent2Agent collaboration reduces error rates. For stronger models like Claude, gains are negligible.

Heterogeneous teams — pairing a strong “manager” agent with weaker executors — improve performance if executors are reliable.


Conclusion: The Road to Truly Useful Agents

ARE and Gaia2 mark a milestone in testing agents against realistic constraints. They move evaluation beyond turn-by-turn into dynamic, asynchronous worlds.

Key takeaways:

  1. The Real World is Asynchronous: Agents must handle continuous change; ARE enables building such worlds openly.
  2. Intelligence Is More Than Accuracy: Gaia2 shows adaptability, robustness, and time-awareness as major gaps.
  3. Trade-offs Are Fundamental: Speed, cost, and reasoning power must be balanced.
  4. Adaptive Compute Is the Future: Agents should adjust computational resources to task complexity — quick and cheap for trivial tasks; deep reasoning for hard ones.

The “second half” of AI progress will depend on defining meaningful tasks and robust evaluations. ARE and Gaia2 give the community powerful tools to push boundaries, surface weaknesses, and guide the design of truly capable AI assistants.