Introduction: The Agent’s Dilemma

Imagine an AI agent tasked with a complex research question such as, “What are the long-term economic impacts of adopting renewable energy sources in developing nations?” To answer this, the agent can’t rely only on its pre-trained knowledge—it must act like a real researcher: search the web, read academic papers, analyze data, and compose a coherent answer from many sources.

This is the promise of deep research agents—AI systems that autonomously navigate the vast sea of information to construct new knowledge. Early approaches demonstrated remarkable potential, yet they all hit the same fundamental barrier: the context window.

Most current agents operate under what the authors of a new paper call the mono-contextual paradigm. These agents take the initial question, perform a search, record their thought process, and append everything—thoughts, tool results, and retrieved information—into a single, ever-growing text block. They then repeat this sequence. While simple, this approach quickly collapses under the weight of long tasks, facing two critical issues:

  1. Context Suffocation: As exploration progresses, the context window fills with accumulated data, previous reasoning, and old observations. Each new step leaves less room for fresh thinking. Eventually, the model becomes overwhelmed by its own history, producing superficial or early-stopping conclusions.

  2. Noise Contamination: Not every search yields useful results. Early mistakes, irrelevant pages, and failed attempts linger permanently in the prompt. Their noise spreads through every subsequent step, eroding clarity and corrupting the agent’s overall reasoning.

To fix these limitations, researchers from Alibaba Group and Renmin University of China propose IterResearch, a groundbreaking paradigm that reimagines long-horizon research as a loop of exploration and synthesis rather than a linear accumulation of data. The key idea is simple yet profound: instead of storing everything forever, the agent periodically reconstructs its workspace—preserving insights, discarding noise, and recovering focus. Modeled formally as a Markov Decision Process (MDP), this iterative structure allows AI agents to reason coherently even after thousands of interactions.

In this post, we’ll explore how the IterResearch paradigm works, why it solves the context-window problem, and what its success tells us about the future of autonomous reasoning.


Background: From RAG to the Mono-Contextual Trap

Before understanding IterResearch, it helps to recall the current landscape of AI reasoning agents. A well-known technique, Retrieval-Augmented Generation (RAG), lets language models fetch relevant documents from a fixed database like Wikipedia to boost factual responses. However, RAG systems are confined to static resources; they cannot actively probe dynamic environments such as the live web.

Deep research agents take the next leap. Equipped with tools like web search, browsers, and Python interpreters, they can navigate real-world information sources to construct new knowledge. The prevailing way to structure them has been the mono-contextual paradigm, typified by methods such as ReAct (Reason + Act).

The mono-contextual paradigm accumulates all information in a single, growing context, leading to suffocation and noise. In contrast, the Iterative Deep-Research Paradigm used by IterResearch works in cycles, reconstructing its workspace to maintain focus.

Figure 1: The mono-contextual paradigm (top) linearly accumulates context, causing suffocation and noise. IterResearch (bottom) reconstructs the workspace each round, maintaining clarity and sustained reasoning.

In the top half of Figure 1, we see the linear process: Think → Act → Observe, repeated endlessly. Each iteration appends its output to a massive transcript. For short tasks, this is manageable. For long ones, it becomes chaotic—forcing the agent to reread its entire history at every step, wasting compute and attention.

IterResearch breaks this loop through cyclical reconstruction: by strategically forgetting irrelevant history and summarizing essential insights, it maintains a clean, bounded workspace at all times.


The Core Method: IterResearch’s Cycle of Synthesis

IterResearch replaces linear accumulation with structured iteration. Like a human researcher, it operates in cycles—reading, noting, synthesizing, and refocusing. Each round integrates what was learned in previous steps and prepares for the next.

This behavior is formalized using a Markov Decision Process (MDP). The MDP has three key components:

  • State (\(S\)): The Agent’s Workspace. Each state \(s_t\) contains only:
  1. The original question \(q\),
  2. An evolving report \(\mathcal{M}_t\) summarizing essential findings,
  3. The most recent interaction \(\{a_{t-1}, \mathrm{TR}_{t-1}\}\), representing the last action and its feedback.
  • Decision (\(\mathcal{D}\)): The Agent’s Output. At each step, the agent issues a structured decision \(d_t = (\text{Think}_t, \mathcal{M}_{t+1}, a_t)\):
  1. Think – reasoning about progress and what to seek next;
  2. Report – updating memory by synthesizing validated findings;
  3. Action – choosing either a tool (search, browse, compute) or producing the final answer.
  • Transition (\(\mathcal{T}\)): Workspace Reconstruction. After executing the action \(a_t\) and receiving tool response \(\mathrm{TR}_t\), the system builds the next state \(s_{t+1} = (q, \mathcal{M}_{t+1}, \{a_t, \mathrm{TR}_t\})\). The full history is summarized into a single, compact report—preventing the runaway prompt growth seen in traditional agents.

The core decision and transition loop of IterResearch. The agent’s policy π takes the current state s_t and produces a decision d_t. The environment E executes the action, and the transition function T reconstructs the next state s_{t+1}.

Figure 2: The structured loop of IterResearch where each round reconstructs state from a concise report and last interaction, maintaining Markov consistency and bounded memory.

The Secret Sauce: Markovian Workspace Reconstruction

This reconstruction mechanism enforces the Markov property: every decision depends only on the current state, not on the full history. The workspace remains constant-sized (\(O(1)\)), whereas ReAct-style agents grow linearly (\(O(t)\)).

A comparison of context growth. Mono-contextual context grows linearly with time (O(t)), while IterResearch’s workspace remains constant (O(1)).

Figure 3: Mono-contextual approaches suffer from linear context growth, while IterResearch maintains constant workspace size.

This yields two major benefits:

  1. Freedom from Context Suffocation: With constant workspace size, the agent’s reasoning capacity never diminishes—even after thousands of steps.
  2. Automatic Noise Filtering: Only information intentionally preserved in the evolving report carries forward. Irrelevant or incorrect data are naturally discarded.

The result is a scalable agent capable of thousands of interactions while maintaining coherent thought—an unprecedented feat in long-horizon reasoning.


Training an Efficient Researcher: Efficiency-Aware Policy Optimization (EAPO)

Designing IterResearch’s behavior requires teaching it how to explore efficiently. A simple “reward if correct” is insufficient; the model must learn to reach conclusions quickly and economically.

To achieve this, the authors designed Efficiency-Aware Policy Optimization (EAPO)—a reinforcement learning framework combining reward shaping and distributed training stability.

1. Geometric Discounting for Efficiency

EAPO reshapes the sparse reward signal using geometric discounts:

Equation for discounted rewards. R_T is the final reward (1 for correct, 0 for incorrect), T is the total number of steps, and γ is a discount factor slightly less than 1.

Equation 1: Reward shaping enforces efficiency. Earlier correct actions yield higher discounted rewards.

With this formulation, even when two trajectories reach correct answers, the shorter one earns more cumulative reward. This subtle change pressures the model to favor concise, focused reasoning over exhaustive exploration.

2. Stable Training via Adaptive Downsampling

The iterative paradigm produces multiple samples per trajectory—one for each round—making sample counts variable across questions. To maintain stable distributed training, the researchers introduced adaptive downsampling:

Equation for adaptive downsampling. The total number of samples |C| is adjusted to be perfectly divisible by the data parallel size (DP_size).

Equation 2: Adaptive downsampling ensures balanced batches for distributed reinforcement learning.

This technique rounds total samples to fit evenly across GPUs with negligible data loss (<1%), ensuring smooth large-scale optimization. Combined with Group Sequence Policy Optimization (GSPO), EAPO trains IterResearch to be both accurate and efficient.

The GSPO objective function used to train IterResearch. This formula optimizes the agent’s policy θ based on the discounted rewards and importance sampling ratios.

Equation 3: GSPO objective integrates discounted rewards and sequence-level optimization to train the agent policy.


Experiments and Results: Putting IterResearch to the Test

The authors evaluated IterResearch against a diverse set of baselines—from direct inference with large frontier models to advanced open-source and proprietary deep-research agents.

Main Results: A New State of the Art

Performance of IterResearch against other leading open-source agents. IterResearch shows a clear and consistent advantage across all four benchmarks shown.

Figure 4: IterResearch outperforms state-of-the-art open-source long-horizon agents across multiple benchmarks.

Across six challenging datasets—BrowseComp, BrowseComp-zh, Humanity’s Last Exam (HLE), GAIA, Xbench-DeepSearch, and SEAL-0—IterResearch achieves an average improvement of +14.5 percentage points over all open-source baselines. It even narrows or surpasses commercial systems like OpenAI’s DeepResearch.

Table of main results across six benchmarks. IterResearch consistently achieves the highest scores among open-source agents and is competitive with proprietary systems.

Table 1: IterResearch consistently dominates across diverse reasoning and exploration benchmarks.

IterResearch maintains strong performance in both information-seeking tasks (e.g., BrowseComp) and complex analytical tasks (e.g., GAIA, HLE). In exploratory tasks, it stays focused thanks to report-based synthesis; in analytical ones, it filters noise through periodic reasoning checkpoints.


Ablation Studies: Unpacking the Design

Ablation study results. The table shows the impact of the training method (EAPO vs. GSPO vs. SFT) and the paradigm itself (Iterative vs. Mono-contextual).

Table 2: Ablations highlight the gains from efficiency-aware training and iterative workspace design.

Two insights stand out:

  1. Efficiency-Aware Policy Optimization Works: Agents trained with EAPO achieve comparable or better accuracy while using 5.7% fewer turns, confirming that geometric rewards encourage tighter, more purposeful exploration.

  2. Iteration Beats Accumulation: Even when given identical training data, the iterative paradigm outperforms the mono-contextual agent by 12.6 points on average—even though the latter uses a 64K context versus IterResearch’s 40K. Expanding memory alone cannot overcome the inherent inefficiency of linear accumulation.


Scaling Beyond Limits: 2,048 Interactions and Beyond

To test scalability, the authors ran IterResearch on the BrowseComp benchmark, increasing the allowed maximum turns from 2 up to an unprecedented 2048.

Graph showing interaction scaling. As the maximum allowed turns increase (x-axis), the accuracy (purple line) dramatically improves, while the average turns used (orange line) grows sub-linearly.

Figure 5: Interaction scaling shows remarkable performance growth from 3.5% to 42.5% accuracy as the interaction budget expands from 2 to 2048.

The results are striking. Accuracy climbs from 3.5% at 2 turns to 42.5% at 2048, proving that deeper exploration dramatically increases research performance. Even with thousands of available rounds, the agent manages its search intelligently—using only about 130 turns on average, stopping once sufficient information is gathered.

This scaling demonstrates that the perceived difficulty of long-horizon tasks stems more from insufficient exploration limits than from inherent complexity.


IterResearch as a Prompting Strategy: No Training Required

Performance comparison of IterResearch vs. ReAct as a prompting strategy for two frontier models. The IterResearch prompt consistently outperforms the standard ReAct prompt.

Figure 6: IterResearch prompt design boosts frontier models such as o3 and DeepSeek-V3.1 compared to traditional ReAct.

Finally, the authors explored whether IterResearch’s structured reasoning loop could serve directly as a prompting strategy—allowing existing large models to inherit its benefits without training.

The results are immensely promising: when tested with OpenAI’s o3 and DeepSeek-V3.1, IterResearch-based prompts outperform ReAct by wide margins—+12.7pp for o3 and +19.2pp for DeepSeek—especially on long-horizon benchmarks like BrowseComp. These gains confirm that IterResearch’s structure provides a universally beneficial “cognitive framework” for controlled reasoning, independent of model architecture.


Conclusion and Implications

IterResearch marks a major evolution in how AI agents handle complex, multi-step reasoning. Instead of letting context grow endlessly, it introduces cycles of exploration and synthesis, reformulated through a Markovian lens. This simple yet profound shift eliminates two longstanding obstacles—context suffocation and noise contamination—while enabling theoretically unlimited reasoning depth.

Key takeaways include:

  • Iteration over Accumulation: Sustainable reasoning emerges from periodic synthesis, not endless memory expansion.
  • Unbounded Scalability: The Markovian framework allows agents to operate over thousands of interactions without degradation.
  • Broad Applicability: IterResearch works both as a training framework for reinforcement learning and as a plug-and-play prompting method for existing LLMs.

Ultimately, IterResearch reveals that building smarter agents isn’t about bigger models or larger context windows—it’s about better thinking structures. By teaching machines to pause, summarize, and rebuild their workspace, we take a decisive step toward creating AI systems capable of genuine research and long-term reasoning in the wild.