Introduction: The Agent’s Dilemma
Imagine an AI agent tasked with a complex research question such as, “What are the long-term economic impacts of adopting renewable energy sources in developing nations?” To answer this, the agent can’t rely only on its pre-trained knowledge—it must act like a real researcher: search the web, read academic papers, analyze data, and compose a coherent answer from many sources.
This is the promise of deep research agents—AI systems that autonomously navigate the vast sea of information to construct new knowledge. Early approaches demonstrated remarkable potential, yet they all hit the same fundamental barrier: the context window.
Most current agents operate under what the authors of a new paper call the mono-contextual paradigm. These agents take the initial question, perform a search, record their thought process, and append everything—thoughts, tool results, and retrieved information—into a single, ever-growing text block. They then repeat this sequence. While simple, this approach quickly collapses under the weight of long tasks, facing two critical issues:
Context Suffocation: As exploration progresses, the context window fills with accumulated data, previous reasoning, and old observations. Each new step leaves less room for fresh thinking. Eventually, the model becomes overwhelmed by its own history, producing superficial or early-stopping conclusions.
Noise Contamination: Not every search yields useful results. Early mistakes, irrelevant pages, and failed attempts linger permanently in the prompt. Their noise spreads through every subsequent step, eroding clarity and corrupting the agent’s overall reasoning.
To fix these limitations, researchers from Alibaba Group and Renmin University of China propose IterResearch, a groundbreaking paradigm that reimagines long-horizon research as a loop of exploration and synthesis rather than a linear accumulation of data. The key idea is simple yet profound: instead of storing everything forever, the agent periodically reconstructs its workspace—preserving insights, discarding noise, and recovering focus. Modeled formally as a Markov Decision Process (MDP), this iterative structure allows AI agents to reason coherently even after thousands of interactions.
In this post, we’ll explore how the IterResearch paradigm works, why it solves the context-window problem, and what its success tells us about the future of autonomous reasoning.
Background: From RAG to the Mono-Contextual Trap
Before understanding IterResearch, it helps to recall the current landscape of AI reasoning agents. A well-known technique, Retrieval-Augmented Generation (RAG), lets language models fetch relevant documents from a fixed database like Wikipedia to boost factual responses. However, RAG systems are confined to static resources; they cannot actively probe dynamic environments such as the live web.
Deep research agents take the next leap. Equipped with tools like web search, browsers, and Python interpreters, they can navigate real-world information sources to construct new knowledge. The prevailing way to structure them has been the mono-contextual paradigm, typified by methods such as ReAct (Reason + Act).

Figure 1: The mono-contextual paradigm (top) linearly accumulates context, causing suffocation and noise. IterResearch (bottom) reconstructs the workspace each round, maintaining clarity and sustained reasoning.
In the top half of Figure 1, we see the linear process: Think → Act → Observe, repeated endlessly. Each iteration appends its output to a massive transcript. For short tasks, this is manageable. For long ones, it becomes chaotic—forcing the agent to reread its entire history at every step, wasting compute and attention.
IterResearch breaks this loop through cyclical reconstruction: by strategically forgetting irrelevant history and summarizing essential insights, it maintains a clean, bounded workspace at all times.
The Core Method: IterResearch’s Cycle of Synthesis
IterResearch replaces linear accumulation with structured iteration. Like a human researcher, it operates in cycles—reading, noting, synthesizing, and refocusing. Each round integrates what was learned in previous steps and prepares for the next.
This behavior is formalized using a Markov Decision Process (MDP). The MDP has three key components:
- State (\(S\)): The Agent’s Workspace. Each state \(s_t\) contains only:
- The original question \(q\),
- An evolving report \(\mathcal{M}_t\) summarizing essential findings,
- The most recent interaction \(\{a_{t-1}, \mathrm{TR}_{t-1}\}\), representing the last action and its feedback.
- Decision (\(\mathcal{D}\)): The Agent’s Output. At each step, the agent issues a structured decision \(d_t = (\text{Think}_t, \mathcal{M}_{t+1}, a_t)\):
- Think – reasoning about progress and what to seek next;
- Report – updating memory by synthesizing validated findings;
- Action – choosing either a tool (search, browse, compute) or producing the final answer.
- Transition (\(\mathcal{T}\)): Workspace Reconstruction. After executing the action \(a_t\) and receiving tool response \(\mathrm{TR}_t\), the system builds the next state \(s_{t+1} = (q, \mathcal{M}_{t+1}, \{a_t, \mathrm{TR}_t\})\). The full history is summarized into a single, compact report—preventing the runaway prompt growth seen in traditional agents.

Figure 2: The structured loop of IterResearch where each round reconstructs state from a concise report and last interaction, maintaining Markov consistency and bounded memory.
The Secret Sauce: Markovian Workspace Reconstruction
This reconstruction mechanism enforces the Markov property: every decision depends only on the current state, not on the full history. The workspace remains constant-sized (\(O(1)\)), whereas ReAct-style agents grow linearly (\(O(t)\)).

Figure 3: Mono-contextual approaches suffer from linear context growth, while IterResearch maintains constant workspace size.
This yields two major benefits:
- Freedom from Context Suffocation: With constant workspace size, the agent’s reasoning capacity never diminishes—even after thousands of steps.
- Automatic Noise Filtering: Only information intentionally preserved in the evolving report carries forward. Irrelevant or incorrect data are naturally discarded.
The result is a scalable agent capable of thousands of interactions while maintaining coherent thought—an unprecedented feat in long-horizon reasoning.
Training an Efficient Researcher: Efficiency-Aware Policy Optimization (EAPO)
Designing IterResearch’s behavior requires teaching it how to explore efficiently. A simple “reward if correct” is insufficient; the model must learn to reach conclusions quickly and economically.
To achieve this, the authors designed Efficiency-Aware Policy Optimization (EAPO)—a reinforcement learning framework combining reward shaping and distributed training stability.
1. Geometric Discounting for Efficiency
EAPO reshapes the sparse reward signal using geometric discounts:

Equation 1: Reward shaping enforces efficiency. Earlier correct actions yield higher discounted rewards.
With this formulation, even when two trajectories reach correct answers, the shorter one earns more cumulative reward. This subtle change pressures the model to favor concise, focused reasoning over exhaustive exploration.
2. Stable Training via Adaptive Downsampling
The iterative paradigm produces multiple samples per trajectory—one for each round—making sample counts variable across questions. To maintain stable distributed training, the researchers introduced adaptive downsampling:

Equation 2: Adaptive downsampling ensures balanced batches for distributed reinforcement learning.
This technique rounds total samples to fit evenly across GPUs with negligible data loss (<1%), ensuring smooth large-scale optimization. Combined with Group Sequence Policy Optimization (GSPO), EAPO trains IterResearch to be both accurate and efficient.

Equation 3: GSPO objective integrates discounted rewards and sequence-level optimization to train the agent policy.
Experiments and Results: Putting IterResearch to the Test
The authors evaluated IterResearch against a diverse set of baselines—from direct inference with large frontier models to advanced open-source and proprietary deep-research agents.
Main Results: A New State of the Art

Figure 4: IterResearch outperforms state-of-the-art open-source long-horizon agents across multiple benchmarks.
Across six challenging datasets—BrowseComp, BrowseComp-zh, Humanity’s Last Exam (HLE), GAIA, Xbench-DeepSearch, and SEAL-0—IterResearch achieves an average improvement of +14.5 percentage points over all open-source baselines. It even narrows or surpasses commercial systems like OpenAI’s DeepResearch.

Table 1: IterResearch consistently dominates across diverse reasoning and exploration benchmarks.
IterResearch maintains strong performance in both information-seeking tasks (e.g., BrowseComp) and complex analytical tasks (e.g., GAIA, HLE). In exploratory tasks, it stays focused thanks to report-based synthesis; in analytical ones, it filters noise through periodic reasoning checkpoints.
Ablation Studies: Unpacking the Design

Table 2: Ablations highlight the gains from efficiency-aware training and iterative workspace design.
Two insights stand out:
Efficiency-Aware Policy Optimization Works: Agents trained with EAPO achieve comparable or better accuracy while using 5.7% fewer turns, confirming that geometric rewards encourage tighter, more purposeful exploration.
Iteration Beats Accumulation: Even when given identical training data, the iterative paradigm outperforms the mono-contextual agent by 12.6 points on average—even though the latter uses a 64K context versus IterResearch’s 40K. Expanding memory alone cannot overcome the inherent inefficiency of linear accumulation.
Scaling Beyond Limits: 2,048 Interactions and Beyond
To test scalability, the authors ran IterResearch on the BrowseComp benchmark, increasing the allowed maximum turns from 2 up to an unprecedented 2048.

Figure 5: Interaction scaling shows remarkable performance growth from 3.5% to 42.5% accuracy as the interaction budget expands from 2 to 2048.
The results are striking. Accuracy climbs from 3.5% at 2 turns to 42.5% at 2048, proving that deeper exploration dramatically increases research performance. Even with thousands of available rounds, the agent manages its search intelligently—using only about 130 turns on average, stopping once sufficient information is gathered.
This scaling demonstrates that the perceived difficulty of long-horizon tasks stems more from insufficient exploration limits than from inherent complexity.
IterResearch as a Prompting Strategy: No Training Required

Figure 6: IterResearch prompt design boosts frontier models such as o3 and DeepSeek-V3.1 compared to traditional ReAct.
Finally, the authors explored whether IterResearch’s structured reasoning loop could serve directly as a prompting strategy—allowing existing large models to inherit its benefits without training.
The results are immensely promising: when tested with OpenAI’s o3 and DeepSeek-V3.1, IterResearch-based prompts outperform ReAct by wide margins—+12.7pp for o3 and +19.2pp for DeepSeek—especially on long-horizon benchmarks like BrowseComp. These gains confirm that IterResearch’s structure provides a universally beneficial “cognitive framework” for controlled reasoning, independent of model architecture.
Conclusion and Implications
IterResearch marks a major evolution in how AI agents handle complex, multi-step reasoning. Instead of letting context grow endlessly, it introduces cycles of exploration and synthesis, reformulated through a Markovian lens. This simple yet profound shift eliminates two longstanding obstacles—context suffocation and noise contamination—while enabling theoretically unlimited reasoning depth.
Key takeaways include:
- Iteration over Accumulation: Sustainable reasoning emerges from periodic synthesis, not endless memory expansion.
- Unbounded Scalability: The Markovian framework allows agents to operate over thousands of interactions without degradation.
- Broad Applicability: IterResearch works both as a training framework for reinforcement learning and as a plug-and-play prompting method for existing LLMs.
Ultimately, IterResearch reveals that building smarter agents isn’t about bigger models or larger context windows—it’s about better thinking structures. By teaching machines to pause, summarize, and rebuild their workspace, we take a decisive step toward creating AI systems capable of genuine research and long-term reasoning in the wild.
](https://deep-paper.org/en/paper/2511.07327/images/cover.png)