Baking System 2 Thinking into LLMs: How Offline Simulation Improves Reasoning

Large Language Models (LLMs) like GPT-4 and Llama 2 have dazzled the world with their ability to write poetry, code, and essays. However, when it comes to rigorous logical reasoning or complex multi-step mathematics, cracks often appear in the facade. The model might hallucinate facts, make logical leaps that don’t follow, or simply guess the final answer without understanding the “why.”

In cognitive science, this rapid, intuitive response is often called System 1 thinking. But for complex problem-solving, humans use System 2 thinking—a slower, more deliberate process involving planning, evaluating intermediate steps, and backtracking if necessary.

How do we get LLMs to perform this “System 2” reasoning? A common approach is Reasoning-as-Planning, where the model explores different paths during inference (like a chess bot thinking several moves ahead). The problem? It’s incredibly slow and computationally expensive to do this every time you ask a question.

In this post, we’ll dive into a fascinating paper, “Learning Planning-based Reasoning via Trajectories Collection and Process Reward Synthesizing,” which proposes a clever solution: move the planning process from inference time to training time. By using offline simulations to synthesize “process rewards” and then training the model with Direct Preference Optimization (DPO), the researchers created a 7B parameter model that outperforms much larger models like GPT-3.5-Turbo on logical reasoning tasks.

The Core Problem: Outcome vs. Process

To understand the innovation here, we first need to look at how we currently teach LLMs to reason.

Most methods rely on Outcome Supervision. You give the model a math problem, it generates an answer, and you tell it “Correct” or “Incorrect.” This is like grading a student only on their final answer. If they got the right answer but used flawed logic, they learned the wrong lesson. If they used perfect logic but made a typo at the end, they get punished.

A better way is Process Supervision. This involves grading the steps of the reasoning (the “Chain of Thought”). While effective, this usually requires humans to manually annotate every step of thousands of reasoning traces—a process that is prohibitively expensive and hard to scale.

The Planning Alternative

Another alternative is treating reasoning as a search problem. As shown in Figure 2(a) below, during inference, the model generates a “tree” of possible thoughts. A verifier scores these intermediate steps, and an algorithm like Monte Carlo Tree Search (MCTS) finds the best path.

Comparison between search-based inference and trajectory collection-based offline training.

While Figure 2(a) yields great results, it causes high latency. Imagine waiting 30 seconds for every sentence because the model is running a search algorithm in the background.

The authors of this paper propose the approach in Figure 2(b): use that search process offline to collect data, synthesize rewards for every step, and then fine-tune the model to internalize that planning capability. The result? A model that reasons better instantly, without the inference-time delay.

The Methodology: Synthesizing Rewards via Simulation

The researchers’ framework is a pipeline designed to turn “Outcome Supervision” (which is cheap/available) into “Process Supervision” (which is valuable). Let’s break down the method, illustrated in Figure 3.

The overall framework of the approach, showing trajectory collection, reward synthesis, and optimization.

Step 1: Trajectory Collection & Partial Exploration

First, the model generates full solutions (trajectories) for a set of problems.

The key innovation lies in Offline Simulation. To determine if a specific intermediate step (say, Step 3 of a 10-step math proof) is good, the researchers don’t ask a human. Instead, they use Monte Carlo estimation.

They take that specific intermediate state and ask the model to finish the problem from that point, multiple times (e.g., 50 different completions).

  • If 45 out of 50 completions lead to the correct final answer, that intermediate step is highly valuable.
  • If only 2 out of 50 lead to the correct answer, that step is likely a hallucination or a logical error.

This process essentially estimates the expected future reward, or “value,” of a specific state.

Equation for estimating expected value via simulation.

As shown in the equation above, the estimated reward \(r_e\) for a trajectory at step \(t\) is the sum of correct outcomes \(r_f\) across \(K\) simulations.

Step 2: Training a Process Reward Model (PRM)

Running 50 simulations for every step is too slow to do during the final training loop. So, the researchers use the data collected in Step 1 to train a Process Reward Model (PRM).

This PRM is a classifier that learns to predict the “value” of a reasoning step directly. It looks at a half-finished solution and predicts: If we continue from here, what is the probability of getting the right answer?

The PRM is trained using the dataset constructed from the simulations:

Dataset definition for reward modeling.

By training this model, they smooth out the noise from the random simulations and create a fast, efficient scorer for reasoning steps.

Step 3: Constructing Preferences

Now the researchers have a way to score any reasoning path. They generate pairs of solutions for the same problem and calculate a trajectory-level reward for each.

The reward for a full trajectory isn’t just “did it get the right answer?” It is the accumulated confidence of the PRM at every step along the way.

Equation for trajectory level reward calculation.

In this equation, \(f_{\text{prm}}\) represents the score from the Process Reward Model. This ensures that a trajectory is only considered “good” if the process was sound, not just if the model got lucky with the final answer.

Step 4: Direct Preference Optimization (DPO)

Finally, the policy model (the LLM itself) is trained using Direct Preference Optimization (DPO).

DPO is a stable and efficient alternative to Reinforcement Learning (like PPO). It works by taking pairs of outputs—one “winning” (\(y_w\)) and one “losing” (\(y_l\))—and adjusting the model probabilities to favor the winner.

DPO Loss Function.

However, unlike standard DPO which only cares about the final answer, this method uses the Process Reward to determine the winner. A trajectory is chosen as the winner only if its accumulated process reward is significantly higher than the alternative. This creates a dataset \(\mathcal{D}_p\) of high-quality reasoning pairs:

Process-supervised preference dataset definition.

This method, dubbed pDPO (Process-supervised DPO), forces the model to learn the structure of good reasoning, effectively distilling the “System 2” planning capability into the model’s standard weights.

Experimental Results

The researchers tested their Llama-2-7B model (fine-tuned with pDPO) against standard benchmarks for Logical Reasoning (LogiQA, ReClor) and Mathematical Reasoning (GSM8K).

Beating the Giants

The results were striking. As shown in Table 1, the 7B parameter model trained with this method (Llama2-7B-pDPO) outperformed GPT-3.5-Turbo on the LogiQA-v2 dataset.

Table of experimental results on logical reasoning benchmarks.

Looking at the table:

  • Llama2-7B-SFT (standard fine-tuning) achieves 45.5 on LogiQA.
  • Llama2-7B-DPO (standard DPO) bumps that to 53.1.
  • Llama2-7B-pDPO (this method) reaches 55.5.

This demonstrates that adding synthesized process supervision yields significant gains over outcome supervision alone.

Data Efficiency

One of the most promising findings is how data-efficient this method is. Because every intermediate step provides a learning signal (rather than just one signal at the very end of the generated text), the model learns much faster.

Figure 4 shows that pDPO (the red line) consistently outperforms standard DPO (blue line) and SFT (green line), even when trained on only 40% or 60% of the available data.

Accuracy comparison charts showing pDPO outperforming DPO and SFT across different data ratios.

Quality of Reasoning

Does the model actually reason better, or is it just gaming the metrics? To test this, the researchers used GPT-4 to act as a judge, comparing the rationales generated by standard DPO vs. pDPO.

They evaluated the outputs on three criteria: Reasonable (valid deduction), Concise (no fluff), and Logically Consistent.

Win rate chart showing pDPO consistently beating DPO in GPT-4 auto-evaluation.

As Figure 6 illustrates, pDPO wins the majority of the time. It produces reasoning traces that are not only more accurate but also more concise. By penalizing wandering, low-confidence steps via the Process Reward Model, the system learns to be direct and logical.

A Concrete Example

What does this look like in practice? The researchers provide a visual example of a solution generated by their model.

Example of a solution generated by the fine-tuned model.

In Figure 1, we see the model utilizing a clear “Thought -> Action -> Observation” pattern (ReAct format). The model breaks down a complex logic puzzle regarding political candidates, evaluating premises step-by-step. The pDPO training allows the model to maintain this structure without getting lost or hallucinating contradictions, which is a common failure mode in smaller models solving logic grid puzzles.

Conclusion and Implications

The paper “Learning Planning-based Reasoning via Trajectories Collection and Process Reward Synthesizing” offers a compelling blueprint for the future of reasoning in small LLMs.

The key takeaways are:

  1. Simulation replaces Annotation: We don’t need expensive human experts to label every step of a math problem. We can simulate outcomes to estimate the value of intermediate steps.
  2. Training > Inference: Instead of running expensive search algorithms every time a user asks a question, we can perform that search offline and use the data to train the model. This gives us the “smarts” of planning with the speed of standard generation.
  3. Process beats Outcome: supervising the process of reasoning leads to more robust models than supervising the answer alone.

This work suggests that the gap between open-source models (like Llama) and proprietary giants (like GPT-4) can be narrowed not just by adding more parameters, but by changing how the models are taught to think. By synthesizing the “System 2” thinking process into a reward signal, we are effectively teaching models to verify their own work before they even finish typing the sentence.