Large Language Models (LLMs) have made stunning progress in reasoning tasks—solving math problems, answering questions, writing code—but their capabilities often depend on two key strategies: reinforcement learning (RL) from verifiable feedback and tool use (such as calling a web search or executing Python code). When combined, these strategies create powerful “LLM agents” that can reason interactively, retrieve facts, and perform calculations.

Yet, the prevailing approaches to building these agents face a major bottleneck. Most train a single, monolithic policy model that interleaves reasoning, tool calls, and answer generation within one long context window. As tasks get more complex—with more tools and longer reasoning horizons—this all-in-one setup becomes unstable during training and struggles to generalize to new tasks.

An alternative comes from agentic systems, which distribute work across multiple specialized modules—like a planner, executor, and verifier—that collaborate through shared memory. This modular design scales better but is often static: the modules are pre-trained LLMs guided by handcrafted prompts instead of learned coordination strategies. As a result, they cannot adapt or improve through experience.

Enter AGENTFLOW, introduced by researchers at Stanford University and collaborators. It combines the best of both worlds: a modular, multi-agent system that is also trainable. AGENTFLOW’s core innovation is its on-policy training algorithm, Flow-based Group Refined Policy Optimization (Flow-GRPO), which teaches a planner to reason and use tools effectively inside the live system. The framework’s results are remarkable—its 7B-parameter implementation outperforms larger specialized models and even surpasses GPT-4o across a range of reasoning benchmarks.

A radar chart and several bar charts showing AgentFlow’s performance boost from Flow-GRPO training and its superiority over other models like GPT-4o.

Figure 1: Performance of AGENTFLOW before and after Flow-GRPO tuning (left) and comparison against top baselines (right). Flow-GRPO enables a 7B-scale model to outperform much larger systems.


The Two Paradigms of Tool-Using LLMs

Before diving into how AGENTFLOW works, let’s first look at the two major types of tool-using LLMs it aims to unify.

1. Monolithic Tool-Integrated Models

In this design, a single LLM handles everything: thinking, deciding when to call a tool, and producing the final answer. The model may output a chain of thoughts with tags such as <think> followed by tool calls like <tool_call>, all in one continuous stream. Feedback is applied at the end—usually a reinforcement signal indicating whether the final answer is right or wrong.

While effective for simple single-tool tasks, this method rapidly becomes unstable as tool diversity and reasoning depth increase:

  • Long contexts make credit assignment and optimization difficult.
  • Weak generalization, as the agent often overfits to specific tools, prompts, or data domains.

2. Static Agentic Systems

Agentic systems, like AutoGen or MetaGPT, distribute reasoning across specialized modules. One model plans, another executes, another verifies results. The modules collaborate through a shared memory in multiple reasoning turns. This structure scales well and allows targeted specialization—but most such systems are training-free. They rely on fixed prompts and manual rules, preventing dynamic adaptation.

A comparison of monolithic tool-integrated models and modular agentic systems.

Figure 3: Monolithic tool-integrated LLMs (left) interleave reasoning and tool calls in a single stream. Static agentic systems (right) decompose tasks but lack learnable coordination.

AGENTFLOW was designed to overcome these limitations—achieving both modularity and learnability.


Inside AgentFlow: A Trainable Modular System

AGENTFLOW is structured as a team of four interacting modules, each playing a distinct role within an iterative reasoning loop.

An overview of the AgentFlow architecture, showing the four modules (Planner, Executor, Verifier, Generator) interacting through an evolving memory.

Figure 2: Architecture of AGENTFLOW. Each reasoning turn involves planning, execution, verification, and memory update. The planner module is trainable via on-policy RL.

  1. Action Planner (𝒫):
    The core decision-making component—and the only trained module. For each turn \(t\), it reads the query \(q\), the available tools \(K\), and the current memory \(M^t\). It then produces an action \(a^t\): a sub-goal and tool selection.

  2. Tool Executor (ℰ):
    Executes the chosen action using the corresponding tool and returns an execution result \(e^t\).

  3. Execution Verifier (𝒱):
    Evaluates whether the current information is sufficient or whether reasoning should continue. It emits a binary signal \(v^t\), determining if the loop stops or proceeds.

  4. Solution Generator (𝒢):
    Once finished (when \(v^t = 1\)), it synthesizes the final answer \(o\) using the accumulated memory \(M^T\).

The shared memory \(M\) acts as a structured transcript of the reasoning process—explicit, deterministic, and transparent. This iterative cycle continues until completion or a maximum turn threshold.


Learning in the Flow: Flow-GRPO

Training the planner to make good decisions is not trivial: one early mistake can cascade into later steps, and feedback is often received only at the end. This is the long-horizon credit assignment problem.

AGENTFLOW addresses this with Flow-based Group Refined Policy Optimization (Flow-GRPO)—a reinforcement learning method that directly optimizes the planner in the live loop of the multi-turn system.

A schematic of the Flow-GRPO optimization process, showing rollouts from a policy and reference model being used to compute rewards and update the policy.

Figure 4: Flow-GRPO optimization process. The algorithm uses full multi-turn trajectories and broadcasts verifiable, trajectory-level rewards to every step.

Key Ideas of Flow-GRPO

  1. Final-Outcome Reward Broadcasting:
    After generating a complete trajectory, a single verifiable reward \(r \in \{0,1\}\) (e.g., correct or incorrect answer) is assigned to every turn of that rollout:

    \[ r = R(a^t) = \bar{R}(o, q, y^*), \quad \forall t = 1, \dots, T. \]

    This transforms multi-turn optimization into a series of independent single-turn updates while maintaining consistency with the global outcome.

  2. On-Policy Learning:
    The planner updates its parameters using real-time trajectories from the actual agentic system, ensuring alignment with live multi-turn interaction dynamics.

  3. Group-Normalized Advantages:
    Rewards are normalized across a group of parallel rollouts, sharpening the learning signal and reducing variance:

    \[ A_i^t = \frac{\bar{R}(o_i) - \text{mean}(\{\bar{R}(o_k)\})}{\text{std}(\{\bar{R}(o_k)\})}. \]
  4. Stable Optimization via PPO & KL Regularization:
    Flow-GRPO incorporates token-level importance ratios, clipping, and KL penalties to maintain stability:

    \[ \mathcal{J}_{\mathrm{Flow-GRPO}} = \mathbb{E}\left[\min\{\rho A, \mathrm{clip}(\rho,1-\epsilon,1+\epsilon)A\}\right] - \beta D_{\mathrm{KL}}(\pi_\theta || \pi_{\mathrm{ref}}). \]

This framework unifies robust credit assignment, outcome-driven reward propagation, and stability—enabling the planner to learn effective strategies for long-horizon reasoning directly in the flow of multi-turn interaction.


Experiments: Putting AgentFlow to the Test

The authors evaluated AGENTFLOW across ten benchmarks covering four reasoning domains:

  • Knowledge-intensive search: Bamboogle, 2Wiki, HotpotQA, Musique
  • Agentic tasks: GAIA benchmark
  • Mathematics: AIME24, AMC23, GameOf24
  • Science: GPQA, MedQA

All modules used Qwen2.5-7B-Instruct, with only the Planner trained via Flow-GRPO.

Head-to-Head Results

Table showing AgentFlow’s accuracy on search-intensive and agentic tasks compared to other models.

Table 1: Accuracy comparison on search-intensive and agentic tasks. Flow-GRPO substantially boosts AGENTFLOW across every benchmark.

Table showing AgentFlow’s accuracy on mathematical and scientific reasoning tasks compared to other models.

Table 2: Comparison on mathematical and scientific tasks. AGENTFLOW consistently outperforms specialized baselines and even GPT-4o.

Key findings:

  • Average accuracy gains: +14.9% (search) | +14.0% (agentic) | +14.5% (math) | +4.1% (science)
  • Surpasses GPT-4o (≈200B parameters) using only a 7B backbone.

What the Planner Learns After Flow-GRPO

1. Smarter Tool Selection

The fine-tuned planner adjusts its tool-use strategy according to the task domain. For example, in general knowledge problems (2Wiki), Google Search usage increases; in specialized domains (MedQA), the agent shifts toward Wikipedia and Web Search.

Bar charts showing how the tool call ratio changes after Flow-GRPO fine-tuning for two different tasks.

Figure 5: Tool-choice optimization after Flow-GRPO fine-tuning. The planner actively learns domain-appropriate tool strategies.

2. More Reliable Execution

Flow-GRPO also enhances how tools are used. Calling-error rates across all benchmarks drop steadily with training.

Line chart showing the decrease in tool-calling error rate as training progresses.

Figure 6: Tool-calling errors decline with training, showing better reliability and argument formatting.

3. Autonomous Problem Discovery

Qualitative examples reveal new solution pathways discovered through live experience. In one case, an untuned agent gets trapped in repetitive failure loops; after Flow-GRPO training, the planner finds entirely new approaches to reach correct answers.

A case study showing an agent failing before tuning but succeeding by exploring a new pathway after tuning.

Figure 7: The trained agent (right) recovers from earlier errors and explores new strategies.


Why In-the-Flow Training Matters

An ablation study compared three planner training approaches:

  1. Frozen Planner (no training)
  2. Offline Supervised Fine-Tuning (SFT)
  3. On-policy Flow-GRPO

Table comparing the performance of AgentFlow with different training methods for the planner.

Table 3: Flow-GRPO delivers the only consistent improvement. Offline SFT collapses due to misaligned supervision.

Results show that simply replacing the planner with GPT-4 yields modest gains, while SFT drastically collapses performance. Static imitation cannot capture live system dynamics. In contrast, Flow-GRPO achieves robust, outcome-driven learning, boosting accuracy by over 17% on average.


Training Efficiency and Scaling

Flow-GRPO training is both efficient and scalable. Training reward steadily grows, while response length stabilizes, indicating improving precision and conciseness. Against a monolithic tool-integrated RL baseline (ToRL), AGENTFLOW shows smoother and stronger learning.

Charts showing training dynamics and a performance comparison against a monolithic RL baseline.

Figure 8: (a) Rewards rise as responses stabilize. (b) Flow-GRPO training maintains steady improvement unlike monolithic baselines.

Scaling tests show consistent improvements across model sizes and turn budgets.

Bar charts showing that Flow-GRPO provides consistent gains for both 3B and 7B models.

Figure 9: Flow-GRPO fine-tuning benefits both small (3B) and large (7B) backbones.

A line chart showing that accuracy improves as the maximum allowed reasoning turns increase.

Figure 10: Accuracy increases with greater allowed turns—AgentFlow makes productive use of longer reasoning horizons.


Key Takeaways and Broader Implications

The AGENTFLOW framework offers a new paradigm for building intelligent, dynamic LLM agents:

  1. Trainable Modular Agents Are the Future.
    Decomposing tasks across specialized components enables scalability. Making these systems trainable unlocks adaptive reasoning unavailable in static setups.

  2. In-the-Flow Optimization Solves Adaptation.
    Training directly within the multi-turn loop allows modules to co-adapt and learn collaborative behavior under real conditions.

  3. Elegant Credit Assignment for Long-Horizon Tasks.
    Broadcasting a single final reward to all steps aligns local decisions with global correctness—an intuitive, effective way to handle sparse rewards.

By integrating a learnable planner into a modular system and training it live, AGENTFLOW demonstrates that structure and strategy can beat sheer size. A 7B-parameter agent outperforming GPT-4o illustrates that smarter training and coordination may matter more than scaling alone.

As we move toward more capable autonomous LLM agents, in-the-flow optimization like Flow-GRPO points the way forward: agents that not only think and use tools, but also learn from their own reasoning process to become increasingly effective.