In the quest for more intelligent AI, we’ve often equated thinking with generating longer and more detailed chains of thought. The prevailing idea was: if a model “thinks longer,” it will eventually arrive at the right answer. This approach has driven substantial progress — but it has a fundamental ceiling.
For truly complex problems — those that require creative leaps, checking intermediate steps, or course-correcting from a flawed path — simply extending a monologue isn’t enough.
What if, instead of just thinking longer, we could teach models to think smarter? That is the core idea behind a new wave of agentic AI. An agentic model doesn’t just talk; it acts. It can use tools, such as a Python interpreter, to explore, calculate, and verify its reasoning, then learn from the feedback it receives.
However, training such agents at scale is notoriously difficult. It’s computationally expensive, and feedback from tools can be noisy or misleading.
Enter rStar2-Agent, a new 14-billion-parameter model from Microsoft Research that redefines what’s possible with efficient agentic training. Despite its relatively small size, it achieves state-of-the-art performance on complex math reasoning tasks — even surpassing models over 40× larger, like the 671B DeepSeek-R1.
How did the researchers achieve this? Through three key innovations:
- A highly efficient infrastructure for agentic reinforcement learning.
- GRPO-RoC, a novel RL algorithm that learns effectively from noisy tool feedback.
- A compute-efficient training recipe that cultivates advanced reasoning with minimal resources.
As shown below, rStar2-Agent reaches top-tier performance in just 510 RL steps, a feat other models take thousands of steps to approach.
In this post, we’ll walk through the technical design of rStar2-Agent, unpacking how it works, why it’s so effective, and what its success means for the future of AI that reasons, reflects, and solves problems.
From Monologue to Dialogue: The Agentic Approach
Traditional large language models (LLMs) tackle reasoning tasks using a Chain of Thought (CoT) — a continuous stream of text with each step written out, much like a student showing their work.
Agentic models take this further. They run calculations in external tools (e.g., Python), inspect outputs, and decide their next steps based on results. This transforms the reasoning into an interactive multi-turn dialogue between the model and its tools.
Example Workflow:
- Turn 1 (Model): “I’ll write a Python script to test prime numbers.” →
<tool_call>
- Turn 1 (Environment): Runs Python code, returns output.
- Turn 2 (Model): “Interesting, try verifying with another script.” →
<tool_call>
- Turn 2 (Environment): Verification passes → returns
True
. - Turn 3 (Model): “Great, verification passed; final answer is 17.”
This back-and-forth lets the model offload heavy computation, verify its logic, and recover from mistakes — essential for solving hard problems.
The team designed a structured JSON format for tool calls and an explicit prompt template to guide usage:
The Core Method: Smarter Learning with GRPO-RoC
Foundation: Group Relative Policy Optimization (GRPO)
GRPO trains the model using an outcome-only reward: 1 if the final answer is correct, 0 if not. This avoids complex reward hacking.
The model generates a group of solutions for each problem and updates its policy based on relative performance within the group (advantages):
The advantage score normalizes each trajectory’s reward against the group average:
Binary reward:
Problem: Reinforcing Bad Habits
Outcome-only rewards work well for plain text reasoning, but introduce issues with tool use:
A trajectory may contain multiple failed tool calls (bugs, timeouts, etc.) but still luckily yield the correct answer. GRPO would reward it equally, subtly teaching the model that tool errors are fine.
As seen below, naive GRPO leads to a plateau in tool error rates among correct trajectories:
Solution: Resample-on-Correct (RoC)
GRPO-RoC refines this by filtering training examples:
- Oversample: Generate 2× the usual group size.
- Separate: Split into positive (correct answer) and negative piles.
- Downsample Asymmetrically:
- Negative: Randomly sample half, preserving diverse failure modes.
- Positive: Sample half by quality — fewest tool errors, proper formatting.
Tool error penalty:
Format penalty:
Lower-penalty positive samples are more likely to be included. This way, the model learns from clean, efficient reasoning traces.
The final GRPO-RoC objective:
Building the Engine: Scalable Agentic RL Infrastructure
Training an agentic model of this scale requires solving two major bottlenecks.
1. Tool Call Bottleneck:
Large-scale batches can trigger tens of thousands of tool executions. Running locally overwhelms CPUs and risks unsafe code execution.
Solution:
A dedicated Environment Service isolates tool execution from the training process, distributed across cluster CPUs. It handles up to 45K concurrent calls per step at 0.3s avg latency.
2. Rollout Imbalance:
Different problems require varying turns and tool calls. Static GPU assignment leads to idle time.
Solution:
A dynamic load-balanced rollout scheduler assigns work based on live GPU KV cache capacity, dispatching tool calls asynchronously.
The Recipe for Success
1. Non-Reasoning Cold Start
Instead of reasoning-heavy SFT, the team started with zero reasoning data, teaching:
- General instruction following
- Tool call JSON formatting
This prevents overfitting to specific reasoning styles before RL.
2. Multi-Stage RL
Training unfolds in three stages, increasing length and problem difficulty:
- Stage 1: 42K problems, 8K max length → concise strategies.
- Stage 2: 12K max length → deeper reasoning.
- Stage 3: 17.3K hardest problems only → push limits.
Performance improves steadily:
Results and Analysis
State-of-the-Art Math Reasoning
Thinking Smarter, Not Longer
rStar2-Agent produces shorter responses while achieving higher accuracy.
Strong Generalization
Math-only training improved reasoning in science tasks and preserved general capabilities:
Ablation: Power of RoC
Without RoC (GRPO with Tool
), performance drops and responses are longer:
Inside the Mind of an Agent
Agentic RL fosters advanced cognitive behaviors. High-entropy tokens reveal:
- Forking Tokens: “But before I conclude…” triggers self-reflection.
- Reflection Tokens on Tool Output: Unique to agentic models — processing environment feedback to debug or verify.
Example: encountering a GeneratorsNeeded
error, the model diagnoses the misuse of SymPy, rewrites robust code, and solves the problem.
Conclusion and Future Directions
The rStar2-Agent story is a blueprint for efficient reasoning-focused AI:
- Noise-resistant RL algorithm: GRPO-RoC
- Scalable infrastructure: Tool-call isolation & dynamic load balancing
- Compute-savvy training: Multi-stage RL, cold start SFT
This work shows that scaling reasoning skills — not just parameter counts — is the path forward. The open-sourced code and recipes invite the community to push agentic training into new domains and more powerful tools.
The era of the agentic LLM is here: models that can generate text, reason deeply, interact with environments, and adapt intelligently.