Beyond Chatbots: How Reinforcement Learning Creates Autonomous AI Researchers

We’re living in an era where Large Language Models (LLMs) are becoming incredibly powerful. Yet for many users, interacting with them still feels like a simple Q&A: you ask, they answer.

But what if an AI could go further? Imagine posing a complex question—such as “What are the long-term economic impacts of quantum computing on the financial sector?”—and having the AI autonomously research it, browse relevant sources, analyze data, and present a comprehensive, evidence-backed report.

This vision lies at the heart of agentic AI: building systems that can reason, plan, and use tools to accomplish multi-step goals. One of the most challenging applications here is Deep Research (DR), where an agent must navigate vast information spaces and synthesize reliable answers through flexible, multi-tool workflows.

A recent paper from Salesforce AI Research, SFR-DeepResearch, presents a compelling approach to this challenge. Instead of orchestrating complex multi-agent systems, they focus on crafting a single autonomous agent capable of handling the research process end-to-end. Their “secret sauce” is a novel reinforcement learning (RL) training recipe—powered entirely by synthetic data—that teaches reasoning-optimized LLMs how to become effective, self-directed researchers.

In this post, we’ll explore their methodology: the carefully engineered agent workflow, why standard RL falls short for this type of task, and how their modifications yield state-of-the-art results in deep research.

Single Agents vs Multi-Agent Teams

Before diving into the training innovations, let’s outline the two broad approaches for building DR systems.

Multi-Agent Systems
Think of these as project teams. A top-level orchestrator decomposes a complex query into sub-tasks and dispatches them to specialized agents:

A Planner to break down steps,
A Researcher to search and retrieve information,
A Coder to perform computations,
A Writer to assemble the final report.

This structured division of labor can be powerful, but it tends to lock agents into fixed workflows.

Single-Agent Systems
Here, one capable LLM is given the question and a set of tools. It decides its next best action—searching, browsing, coding—without any mid-process instruction. This autonomy allows flexibility and potentially stronger generalization to unseen tasks, since it’s not constrained by rigid, predefined steps.

SFR-DeepResearch focuses entirely on this single-agent paradigm. The authors argue that a highly capable single agent is more adaptable, and if needed, can fit into larger multi-agent architectures as a sub-component—reducing overall complexity.

The SFR-DeepResearch Recipe

The team’s framework transforms strong reasoning LLMs into autonomous research agents via three pillars:

An Agentic Workflow tuned to the base model
Challenging, synthetic training data
A reinforced learning algorithm customized for stability

1. Building a Model-Specific Workflow

How an agent interfaces with its tools and manages context is fundamental. The authors design a minimalist toolset, coupled with a workflow adapted to each model’s strengths.

Minimalist’s Toolbox

The agent is equipped with just three core tools:

search_internet(query: str) – Barebones web search API returning top-10 organic results with titles, URLs, and snippets.
browse_page(url: str, section_id: int) – Scrapes HTML to clean Markdown and strips hyperlinks, making pages “static.” To visit a new link, the agent must issue a fresh search.
code_interpreter(code: str) – A secure, stateless Python executor. Each run is isolated, with no variable persistence or risky package access.

By limiting tool sophistication, the authors ensure the agent is challenged to plan strategically and reason efficiently.

Adapting to Model Characteristics

Some models are naturally better at single-step reasoning. For QwQ-32B and Qwen3-8B, multi-turn chat degraded performance: the models’ “thinking” tokens (step-by-step reasoning traces) became erratic over long conversations.

The solution? Recast the interaction as iterative single-turn context-packing (see Figure 1). Every step’s prompt includes:

the original question
all prior tool calls and outputs
merged into one long user message.

This keeps the agent operating in the single-turn mode it’s best at.

An example tool calling trajectory. The process is framed as a single-turn contextual question answering problem, where the user prompt grows with each step to include the entire history of tool calls and results.

Figure 1: Example tool-calling trajectory for QwQ-32B/Qwen3. Previous steps’ actions and results are packed into a single user turn, preserving single-turn optimization.

For gpt-oss-20b, which excels in multi-turn exchanges, they retained the standard multi-turn chat format.

Self-managed Memory

Deep research produces long contexts, which can overflow an LLM’s token limits. Instead of truncating blindly, the agent gains a clean_memory(content: str) tool. When warned of nearing overflow, its only valid action is to use this tool—summarizing and preserving essential facts while discarding extraneous details. This fosters the skill of context compression in service of long-horizon goals.

2. Crafting Truly Challenging Data

The team found current multi-hop QA datasets [HotpotQA, etc.] too easy—solvable without search.

They therefore synthesized two demanding task types:

Short-form QA – multi-hop, fact-seeking queries plus math and code problems, designed to require multiple search iterations.
Long-form reporting – open-ended questions prompting full reports, with rubrics for evaluating factuality, writing quality, and citations.

These tasks are search-intensive and sometimes require up to 50 tool calls to solve. Even OpenAI’s o3-based Deep Research agent scored under 65% accuracy, while many baselines scored below 40%.

3. Reinforcement Learning, Stabilized

Training sequences of tool calls to maximize a final reward is tricky—especially when trajectories vary greatly in length.

The Degeneracy Problem

An agent might overfit to making more tool calls (even repetitive ones), because longer trajectories dominate gradient updates. This leads to bad habits and performance collapse.

Length-Normalized Advantage

The authors modify REINFORCE to scale advantage by trajectory length \( T_i \):

\[ A_{i,j} = \frac{r_i - \operatorname{mean}(\overline{R})}{\operatorname{std}(\overline{R}) \cdot T_i} \]

This downweights credit/blame per step in longer trajectories, preventing them from overwhelming shorter, efficient paths.

A chart showing that without length normalization, the agent’s average trajectory length explodes while performance drops. With normalization, trajectory length is stable and performance improves.

Figure 2: Without length normalization (red), trajectory lengths balloon and performance collapses. With normalization (blue), tool usage is controlled and scores rise.

Additional Stabilizers

Trajectory Filtering – Remove failed/truncated/ill-formatted runs from the replay buffer; maintain a balanced positive/negative ratio.
Partial Rollouts – Reuse partial successful paths as starting states for new episodes, increasing exposure to useful intermediate contexts.

Combined, these techniques maintain RL stability over long-horizon, multi-tool research.

Benchmarking SFR-DR

The team evaluates on three tough benchmarks:

FRAMES – multi-hop reasoning QA with browsing
GAIA – general assistant tasks (text-only set)
HLE – Humanity’s Last Exam, a reasoning-heavy suite across science/math

To ensure fairness, they employ a contamination blocklist, preventing the agent from visiting domains hosting benchmark answers.

A table showing the performance of SFR-DR agents compared to proprietary and open-source baselines. SFR-DR-20B achieves top scores across the board.

Table 1: SFR-DR agents vs proprietary and open-source baselines, evaluated with contamination controls.

Highlights:

SFR-DR-20B tops all open-source baselines and rivals/bests proprietary systems (e.g., OpenAI’s o3 Deep Research).
Achieves 28.7% Pass@1 on HLE—65% relative gain over its base model (gpt-oss-20b).

Why It Works – Analysis

Workflow Matters

Testing their single-turn context-packed workflow against default multi-turn for Qwen/QwQ showed substantial gains before any RL.

A table comparing the performance of the default multi-turn workflow versus the custom SFR-DR workflow, showing significant gains from the workflow alone.

Table 2: Changing to single-turn, packed context boosts FRAMES by ~10% absolute for QwQ-32B.

This confirms that matching workflow to model strengths is a critical—and cost-free—optimization.

Behavioral Shifts Post-RL

Two bar charts comparing tool usage and response length before and after RL training for the different SFR-DR models.

Figure 3: (a) Tool usage grows moderately post-RL; (b) QwQ/Qwen models produce longer responses, while gpt-oss-20b becomes more concise.

Key takeaways:

Tool Usage: RL encourages strategic increases. gpt-oss-20b starts with higher usage, making it a strong agentic foundation.
Response Length: gpt-oss-20b is token-efficient (short “thinking” traces); RL further compresses its output. In contrast, QwQ/Qwen models tend to expand their per-step reasoning after RL.

Conclusion & Takeaways

The SFR-DeepResearch paper delivers a clear, practical blueprint for building autonomous single-agent research systems from reasoning-centric LLMs:

A single well-trained agent can rival multi-agent teams—simplifying architecture without sacrificing capability.
Workflow should be model-specific—opt for formats your base LLM handles best.
Stable RL objectives are essential—length-normalized advantage and quality filtering prevent degenerate behaviors common in long-horizon tasks.

By combining synthetic, search-intensive training data with an RL objective tuned for trajectory control, Salesforce AI’s team transforms open-source reasoning models into powerful, autonomous researchers—bringing us closer to AI collaborators that can truly participate in discovery and analysis.

Single Agents vs Multi-Agent Teams#

The SFR-DeepResearch Recipe#

1. Building a Model-Specific Workflow#

Minimalist’s Toolbox#

Adapting to Model Characteristics#

Self-managed Memory#

2. Crafting Truly Challenging Data#

3. Reinforcement Learning, Stabilized#

The Degeneracy Problem#

Length-Normalized Advantage#

Additional Stabilizers#

Benchmarking SFR-DR#

Why It Works – Analysis#

Workflow Matters#

Behavioral Shifts Post-RL#

Conclusion & Takeaways#