Introduction: The High Cost of Thinking

For years, the go-to method for getting Large Language Models (LLMs) to solve complex reasoning problems has been to make them “think out loud.” By prompting them to generate a step-by-step Chain-of-Thought (CoT), we encourage them to break down complex problems, explore different approaches, and correct their own mistakes along the way. The informal rule has been simple: the more “thinking tokens” a model generates, the better its final answer.

But this approach carries a steep price. Long chains of thought lead to:

  • Inflated Context Length: Every additional token lengthens the sequence, pushing models closer to their context limits and causing well-known “lost in the middle” failures.
  • Higher Compute Costs: Longer reasoning traces require exponentially greater compute and memory, increasing cost and carbon footprint.
  • Increased Latency: Thousands of tokens generated sequentially can result in sluggish response times.

This sets up an unpleasant trade-off among accuracy, cost, and latency. Yet, what if models could think more efficiently—achieving better results without generating endless text?

A recent paper from Meta Superintelligence Labs, “Rethinking Thinking Tokens: LLMs as Improvement Operators,” proposes just that. The authors reimagine LLM reasoning not as a single, linear stream of thought but as an iterative process that improves over rounds. They introduce a family of inference strategies that let models generate diverse ideas in parallel, distill the best insights, and refine answers iteratively, all while keeping the active context compact. This insight opens a new frontier in reasoning efficiency—better accuracy, smaller contexts, and lower latency.

Let’s unpack how this works.


From Linear Chains to Collaborative Refinement

To understand the novelty of this approach, we first revisit how reasoning in LLMs evolved.

Chain-of-Thought (CoT) prompting revolutionized test-time reasoning, showing that simply asking models to reason step by step enhances accuracy on complex tasks. A popular extension, Self-Consistency, samples multiple independent reasoning traces and votes on the best answer—stronger accuracy, but multiplied compute cost.

Later innovations made reasoning more interactive:

  • Self-Improvement: Systems like Self-Refine and Reflexion allow models to critique and revise their own outputs.
  • Multi-Agent Debate: Multiple LLMs engage in back-and-forth argumentation to scrutinize each other’s reasoning.
  • Structured Search: Methods such as Tree of Thoughts (ToT) and Graph of Thoughts explore branching search spaces for multi-step reasoning.

All of these are effective—but sprawling. Every additional reasoning path lengthens the prompt, making each turn more expensive and vulnerable to forgetting earlier context.

This paper brings coherence to these scattered methods under one unifying lens: LLMs as Improvement Operators. By explicitly modeling iteration and resource constraints like latency and compute, it defines a principled space where reasoning can be optimized across multiple dimensions.


LLMs as Improvement Operators

Rather than a static generator, the LLM acts as a dynamic improver—an operator that moves an initial guess closer to an ideal solution.

A Formal Framework for Iteration

Consider a problem \(x\) (e.g., a math question) and a current solution \(s_t\). The model, represented as \( \mathcal{M}_{\theta} \), computes a refined solution \(s_{t+1}\) using a workspace \(C_t\):

\[ s_{t+1} = \mathcal{M}_{\theta}(x, s_t, C_t) \]

LLMs as iterative improvement operators. Each step uses the problem x, current state s_t, and workspace C_t to produce a refined artifact s_{t+1}.

Figure: The model acts as an improvement operator that iteratively updates its solution using a small workspace.

The workspace \(C_t\) is a compact summary—bounded in length to \(\kappa\) tokens—that captures key intermediate results, contradictions, and open goals.

Each iteration runs a read–write–compress cycle:

  1. Read: Process the problem \(x\) and the current summary \(C_t\).
  2. Write: Generate a new candidate solution \(s_{t+1}\).
  3. Compress: Distill \(s_{t+1}\) into a new bounded workspace \(C_{t+1}\):
\[ C_{t+1} = \mathcal{D}(x, s_{t+1}), \quad |C_{t+1}| \leq \kappa \]

The distillation step compresses knowledge into C_{t+1}, keeping context small through a synthesis operator D.

Figure: The synthesis operator distills reasoning into a fresh, compact workspace.

This framework allows the “thinking” to be extensive but bounded—essentially trading off total computation against latency and context length.


Mind Your Budgets: Latency vs. Compute

The authors introduce two token budgets to fairly measure efficiency:

\[ B_{seq} = \sum (in + out)_{\text{accepted path}}, \quad B_{total} = \sum (in + out)_{\text{all calls}} \]

Sequential budget (B_seq) vs total budget (B_total) equations, capturing latency and compute costs respectively.

Figure: Two budget metrics quantify latency and total compute.

  • Sequential Budget (\(B_{seq}\)) reflects how many tokens are processed along a single path—the latency proxy.
  • Total Budget (\(B_{total}\)) counts all tokens produced during reasoning, including discarded drafts—our compute and cost proxy.

This distinction enables methods that keep latency low (small \(B_{seq}\)) by parallelizing exploration across many shorter contexts.


Two Strategies for Iterative Inference

Under this operator framework, two main inference regimes emerge:

Comparison of reasoning regimes: Long CoT (single trace), Sequential Refinement (SR), and Parallel-Distill-Refine (PDR).

Figure: From long chains to iterative operators. The PDR framework separates total compute from per-call latency.

1. Sequential Refinement (SR)

In SR, the model iteratively improves a single solution over several rounds:

\[ s_{t+1} = \mathcal{M}_{\theta}(x, s_t, \emptyset) \]

Equation depicting SR as iterative refinement over R rounds.

Figure: Sequential Refinement repeatedly updates one solution with short steps.

Imagine saying, “Here’s your last try—improve it.” The model refines the same artifacts step by step, without carrying an ever-growing history. It’s depth-oriented and low-cost but can be slow when many iterations are needed.

2. Parallel-Distill-Refine (PDR)

Parallel-Distill-Refine (PDR) is the flagship contribution—a breadth-oriented method that expands exploration while keeping each round short.

Each round performs three operations:

\[ S^{(r)} = \{ s_i^{(r)} = \mathcal{M}_{\theta}(x, C^{(r-1)}) \}_{i=1}^{M_r} \]

\[ C^{(r)} = \mathcal{D}(x, S^{(r)}), \quad |C^{(r)}| \leq \kappa \]

Equations showing parallel generation and distillation in PDR.

Figure: PDR generates multiple drafts, distills them into a concise workspace, and refines further.

Parallel: The model generates \(M_r\) diverse drafts simultaneously.

Distill: These drafts are summarized into a compact workspace—many strategies are possible:

  • Global Summary: Synthesizes agreements and contradictions into a textual report.
  • Top-k Selection: Chooses the most promising drafts.
  • Random-k: Selects random examples to maintain diversity.

Refine: The next round builds upon this distilled state, producing improved answers.

By regenerating summaries from scratch each round, PDR prevents runaway context expansion while using parallelism to convert total compute into higher accuracy—without longer latency.


Operator-Consistent Training: Teaching Models to Iterate

A challenge arises: inference uses this multi-round operator, but most training optimizes for single long traces. To align training with deployment, the authors develop Operator-Consistent Reinforcement Learning (RL).

Training mixes two modes:

  1. Standard Long-Trace RL: Optimizes traditional, single-chain reasoning.
  2. Operator Rollouts: Simulates one round of PDR—parallel generation, distillation, and refinement—to teach the model the iterative interface.

The combined objective averages both losses:

\[ \mathcal{J}_{\text{train}}(\theta) = \frac{1}{2}\mathcal{J}_{\text{trace}} + \frac{1}{2}\mathcal{J}_{\text{op}} \]

Equation showing the averaged training objective between standard and operator-consistent RL losses.

Figure: Combined training objective ensures models learn both long and short reasoning cycles.

This blend instills the meta-skills that make iteration work—verification, summarization, refinement, and diversity generation—bridging the train–test gap.


Experiments: Putting Theory into Practice

The team evaluated these methods on mathematical reasoning benchmarks—AIME 2024 and AIME 2025—using models like gpt-o3-mini and gemini-2.5-flash, measuring performance across token budgets.

RQ1: Can Short-Context Iterations Beat Long Traces?

Yes—decisively. Both SR and PDR outperform long CoT under matched latency budgets.

Performance of Long CoT, SR, and PDR on AIME 2024 with gemini-2.5-flash and gpt-o3-mini. PDR achieves highest accuracy at similar latency.

Figure 3: PDR converts parallel compute into superior accuracy at fixed latency.

For example, with gpt-o3-mini at a 49k sequential budget, accuracy improves from 76.9% (Long CoT) → 81.5% (SR) → 86.7% (PDR). Similar results hold for AIME 2025.

AIME 2025 results showing SR and PDR beating Long CoT at equal latency budgets.

Figure 9: Short-context iteration outperforms long reasoning traces.

Performance viewed through latency and compute trade-offs reveals complementary strengths.

Scatter plots showing Pareto frontiers. PDR excels at fixed latency (low B_seq), SR shines in total compute (low B_total).

Figure 4: PDR dominates the latency–accuracy trade-off frontier.

In a detailed comparison at 90% accuracy (Figure 5), SR needs a sequential budget of 442k tokens, while PDR attains the same result with only 172k—2.6× lower latency.

Bar chart demonstrating PDR’s lower latency for similar accuracy compared to SR.

Figure 5: PDR achieves comparable accuracy with far fewer sequential tokens.


RQ2: Which Distillation Strategy Works Best?

PDR’s efficiency depends on the quality of its workspace synthesis. Experiments tested several variants of the distillation operator \(\mathcal{D}\):

Table comparing global summary, shared top-k, per-sample top-k, and random-k distillation.

Table 2: Global summary and per-sample top-k consistently yield the best results.

Global Summary and Per-sample Top-k emerged as clear winners—aggregating insights across all drafts or leveraging the most promising individual solutions. Random-k selections lagged behind, underscoring that structured summarization drives genuine improvement.


RQ3: How Does Self-Verification Affect Performance?

The next question probes why PDR works so well—or fails. Researchers tested “oracle” variants where summaries were manually curated:

  • Oracle (Correct): Only correct drafts included in the workspace.
  • Oracle (Incorrect): Only incorrect drafts included.
  • Default: Random-k baseline.

Bar charts illustrating the impact of feeding correct vs. incorrect samples into the workspace. Incorrect ones degrade accuracy severely.

Figure 6: Anchoring bias appears when the workspace reinforces incorrect reasoning.

Results are striking: feeding in incorrect solutions tanks performance, while supplying correct ones boosts it. Models can get “anchored” on bad reasoning pathways. Stronger self-verification—the ability to identify and trust correct partial work—proves critical for reliable iterative improvement.


RQ4: Does Operator-Consistent Training Move the Pareto Frontier?

Finally, the study trained an 8B model under operator-consistent RL and compared results.

Table showing that operator-consistent RL improves PDR performance over baseline RL.

Table 3: Operator-consistent RL yields clear gains on both AIME 2024 and 2025.

Mixing standard RL with PDR-specific training improved accuracy by up to +5 percentage points, confirming that aligning training and inference lifts reasoning quality under identical latency budgets. Models learn to “think iteratively” rather than simply output longer traces.


Conclusion: Toward Smarter, Faster Reasoning

This study redefines how we view test-time reasoning. Rather than measuring intelligence by token length, it shows that bounded, iterative thinking surpasses extended monologues.

Key takeaways:

  1. Iteration Beats Linear Thinking: Both SR and PDR outperform long CoT baselines when matched for latency.
  2. Parallel Compute Converts to Accuracy: PDR decouples context span from total reasoning, achieving faster yet smarter outputs.
  3. Compact Workspaces Matter: Success hinges on crafting short, information-rich summaries that preserve insight without bloating context.
  4. Aligned Training Amplifies Gains: Operator-consistent RL closes the gap between practice and deployment, enabling models to master iterative reasoning meta-skills.

The implications are far-reaching. Future LLMs could adaptively choose between depth (Sequential Refinement) and breadth (Parallel-Distill-Refine) depending on problem complexity and compute availability. Training could focus directly on optimizing summarization operators or developing adaptive token budgets tuned to user latency constraints.

By moving beyond the “more tokens means better reasoning” mindset, this research unveils a vision of LLMs as dynamic improvement operators—systems capable of learning, refining, and reasoning efficiently within bounded contexts. It’s a glimpse into a future where smarter reasoning doesn’t mean slower answers.