In the quest for more intelligent AI, we’ve often equated thinking with generating longer and more detailed chains of thought. The prevailing idea was: if a model “thinks longer,” it will eventually arrive at the right answer. This approach has driven substantial progress — but it has a fundamental ceiling.

For truly complex problems — those that require creative leaps, checking intermediate steps, or course-correcting from a flawed path — simply extending a monologue isn’t enough.

What if, instead of just thinking longer, we could teach models to think smarter? That is the core idea behind a new wave of agentic AI. An agentic model doesn’t just talk; it acts. It can use tools, such as a Python interpreter, to explore, calculate, and verify its reasoning, then learn from the feedback it receives.

However, training such agents at scale is notoriously difficult. It’s computationally expensive, and feedback from tools can be noisy or misleading.

Enter rStar2-Agent, a new 14-billion-parameter model from Microsoft Research that redefines what’s possible with efficient agentic training. Despite its relatively small size, it achieves state-of-the-art performance on complex math reasoning tasks — even surpassing models over 40× larger, like the 671B DeepSeek-R1.

A table comparing the performance of rStar2-Agent-14B against other leading models on math benchmarks like AIME24, AIME25, and HMMT25.

How did the researchers achieve this? Through three key innovations:

  1. A highly efficient infrastructure for agentic reinforcement learning.
  2. GRPO-RoC, a novel RL algorithm that learns effectively from noisy tool feedback.
  3. A compute-efficient training recipe that cultivates advanced reasoning with minimal resources.

As shown below, rStar2-Agent reaches top-tier performance in just 510 RL steps, a feat other models take thousands of steps to approach.

A line plot showing rStar2-Agent-14B (teal line) reaching over 80% accuracy on the AIME24 benchmark in just 510 training steps, far outpacing the much larger DeepSeek-R1-Zero (purple line).

In this post, we’ll walk through the technical design of rStar2-Agent, unpacking how it works, why it’s so effective, and what its success means for the future of AI that reasons, reflects, and solves problems.


From Monologue to Dialogue: The Agentic Approach

Traditional large language models (LLMs) tackle reasoning tasks using a Chain of Thought (CoT) — a continuous stream of text with each step written out, much like a student showing their work.

Agentic models take this further. They run calculations in external tools (e.g., Python), inspect outputs, and decide their next steps based on results. This transforms the reasoning into an interactive multi-turn dialogue between the model and its tools.

Example Workflow:

  1. Turn 1 (Model): “I’ll write a Python script to test prime numbers.” → <tool_call>
  2. Turn 1 (Environment): Runs Python code, returns output.
  3. Turn 2 (Model): “Interesting, try verifying with another script.” → <tool_call>
  4. Turn 2 (Environment): Verification passes → returns True.
  5. Turn 3 (Model): “Great, verification passed; final answer is 17.”

This back-and-forth lets the model offload heavy computation, verify its logic, and recover from mistakes — essential for solving hard problems.

The team designed a structured JSON format for tool calls and an explicit prompt template to guide usage:

The prompt template used for rStar2-Agent, showing sections for system instructions, tool definitions, and the user’s question.


The Core Method: Smarter Learning with GRPO-RoC

Foundation: Group Relative Policy Optimization (GRPO)

GRPO trains the model using an outcome-only reward: 1 if the final answer is correct, 0 if not. This avoids complex reward hacking.

The model generates a group of solutions for each problem and updates its policy based on relative performance within the group (advantages):

The mathematical formula for the GRPO objective function.

The advantage score normalizes each trajectory’s reward against the group average:

The formula for calculating the advantage in GRPO, which normalizes the reward of a trajectory against the mean and standard deviation of rewards in its group.

Binary reward:

The formula for the outcome-only binary reward, where r_i is 1 if the answer is equivalent to the ground truth and 0 otherwise.


Problem: Reinforcing Bad Habits

Outcome-only rewards work well for plain text reasoning, but introduce issues with tool use:

A trajectory may contain multiple failed tool calls (bugs, timeouts, etc.) but still luckily yield the correct answer. GRPO would reward it equally, subtly teaching the model that tool errors are fine.

As seen below, naive GRPO leads to a plateau in tool error rates among correct trajectories:

Two plots showing that with standard GRPO (purple line), the tool call error rate in successful trajectories plateaus, while with GRPO-RoC (green line), it continues to decrease.


Solution: Resample-on-Correct (RoC)

GRPO-RoC refines this by filtering training examples:

  1. Oversample: Generate 2× the usual group size.
  2. Separate: Split into positive (correct answer) and negative piles.
  3. Downsample Asymmetrically:
    • Negative: Randomly sample half, preserving diverse failure modes.
    • Positive: Sample half by quality — fewest tool errors, proper formatting.

Tool error penalty:

The formula for the tool error penalty, which is based on the ratio of error tool calls to total tool calls.

Format penalty:

The formula for the format penalty, which penalizes trajectories with missing or multiple answer tags.

Lower-penalty positive samples are more likely to be included. This way, the model learns from clean, efficient reasoning traces.

The final GRPO-RoC objective:

The final objective function for GRPO-RoC, which incorporates the Resample-on-Correct strategy.


Building the Engine: Scalable Agentic RL Infrastructure

Training an agentic model of this scale requires solving two major bottlenecks.

A diagram showing the overall architecture of the agentic RL infrastructure, with a Rollout Scheduler managing LLM inference and an Environment Service for tool calls.

1. Tool Call Bottleneck:
Large-scale batches can trigger tens of thousands of tool executions. Running locally overwhelms CPUs and risks unsafe code execution.

Solution:
A dedicated Environment Service isolates tool execution from the training process, distributed across cluster CPUs. It handles up to 45K concurrent calls per step at 0.3s avg latency.

A graph showing that the code environment can handle up to 45,000 concurrent tool calls per step while keeping latency consistently low at around 0.3 seconds.


2. Rollout Imbalance:
Different problems require varying turns and tool calls. Static GPU assignment leads to idle time.

Solution:
A dynamic load-balanced rollout scheduler assigns work based on live GPU KV cache capacity, dispatching tool calls asynchronously.

A comparison of static rollout allocation (top), which leads to significant idle time, versus the dynamic load-balanced scheduler (bottom), which maximizes GPU utilization.


The Recipe for Success

1. Non-Reasoning Cold Start

Instead of reasoning-heavy SFT, the team started with zero reasoning data, teaching:

  • General instruction following
  • Tool call JSON formatting

A table showing that the non-reasoning SFT improves tool use and instruction following without significantly boosting math reasoning, setting a clean slate for RL.

This prevents overfitting to specific reasoning styles before RL.


2. Multi-Stage RL

Training unfolds in three stages, increasing length and problem difficulty:

A table comparing the rStar2-Agent training recipe to other models, highlighting its shorter training lengths and more targeted difficulty filtering.

  • Stage 1: 42K problems, 8K max length → concise strategies.
  • Stage 2: 12K max length → deeper reasoning.
  • Stage 3: 17.3K hardest problems only → push limits.

Performance improves steadily:

Three plots tracking the AIME24/25 scores and average response length across the three RL training stages, showing steady improvement.


Results and Analysis

State-of-the-Art Math Reasoning

The main results table showing rStar2-Agent-14B achieving top-tier performance on competitive math benchmarks against leading models.


Thinking Smarter, Not Longer

rStar2-Agent produces shorter responses while achieving higher accuracy.

A table showing that rStar2-Agent-14B produces much shorter responses on AIME benchmarks compared to other high-performing models.


Strong Generalization

Math-only training improved reasoning in science tasks and preserved general capabilities:

A table demonstrating rStar2-Agent-14B’s strong generalization to science reasoning and other general benchmarks after being trained only on math.


Ablation: Power of RoC

Without RoC (GRPO with Tool), performance drops and responses are longer:

Ablation study results showing that GRPO-RoC (green) achieves higher accuracy and shorter response lengths compared to GRPO with tools but without RoC (purple).


Inside the Mind of an Agent

Agentic RL fosters advanced cognitive behaviors. High-entropy tokens reveal:

  1. Forking Tokens: “But before I conclude…” triggers self-reflection.
  2. Reflection Tokens on Tool Output: Unique to agentic models — processing environment feedback to debug or verify.

Example: encountering a GeneratorsNeeded error, the model diagnoses the misuse of SymPy, rewrites robust code, and solves the problem.

An example trace where the model encounters a code error, reflects on the error message (highlighted in green), generates corrected code, and successfully solves the problem.


Conclusion and Future Directions

The rStar2-Agent story is a blueprint for efficient reasoning-focused AI:

  • Noise-resistant RL algorithm: GRPO-RoC
  • Scalable infrastructure: Tool-call isolation & dynamic load balancing
  • Compute-savvy training: Multi-stage RL, cold start SFT

This work shows that scaling reasoning skills — not just parameter counts — is the path forward. The open-sourced code and recipes invite the community to push agentic training into new domains and more powerful tools.

The era of the agentic LLM is here: models that can generate text, reason deeply, interact with environments, and adapt intelligently.