Large Language Models (LLMs) have become remarkably adept at tackling complex problems—from writing code to solving intricate math questions. A cornerstone of this success is their ability to “think out loud” through Chain-of-Thought (CoT) reasoning. By generating intermediate steps, an LLM can break down a problem logically and arrive at a more accurate solution.

However, CoT has a fundamental limitation: it’s like someone blurting out the first train of thought that comes to mind. The reasoning follows a single, linear path that might not always be the best or most logical one. This can lead to errors in tasks requiring deeper, multi-step deliberation.

To overcome this, researchers introduced Tree-of-Thought (ToT) reasoning. ToT enables an LLM to explore multiple reasoning paths simultaneously—much like a chess grandmaster evaluating different moves before deciding. At each step, the model generates several possible thoughts, scores their promise, and prunes the weaker branches. The result is more robust and deliberate reasoning. However, this extra deliberation comes at a steep price: it is extremely slow and computationally heavy, making ToT impractical for real-time applications.

This raises a natural question:
Can we combine the deep reasoning quality of Tree-of-Thought with the efficiency of Chain-of-Thought?

A recent paper from Sea AI Lab and Singapore Management University introduces a breakthrough: Chain of Preference Optimization (CPO). CPO distills the wisdom embedded in the ToT search process into a standard, fast LLM. It teaches the model to prefer good reasoning steps over bad ones, transferring the deliberative power of ToT into a single-path, CoT-style model that remains fast at inference time.

Comparison of reasoning methods: CoT (single path), ToT (branching tree search), and CPO (training from ToT-style preference data while keeping CoT inference).

Comparison of Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Chain of Preference Optimization (CPO). CPO collects preference pairs from ToT’s search trees and fine-tunes the model using Direct Preference Optimization (DPO), achieving ToT-level reasoning with CoT-level efficiency.

In this article, we’ll explore how CPO works, what makes it effective, and why it represents a major step forward in improving LLM reasoning.


Background: Thinking in Chains, Trees, and Preferences

Before unpacking CPO, let’s review the foundational methods it builds upon.

1. Chain-of-Thought (CoT)

Chain-of-Thought prompting helps models show their reasoning explicitly rather than jumping directly to conclusions. For instance, to answer “What is the nickname of the easternmost U.S. state?”, a CoT response might look like:

Step 1: The easternmost U.S. state is Maine.
Step 2: The nickname of Maine is The Pine Tree State.
Final answer: The Pine Tree State.

Here, each intermediate “thought” \(z_i\) is generated based on the initial input \(x\) and previous thoughts \(z_1, \dots, z_{i-1}\). It’s efficient, interpretable, and effective—but strictly linear, limiting its capacity to explore alternative reasoning paths.

2. Tree-of-Thought (ToT)

Tree-of-Thought generalizes CoT into a search over possible reasoning paths. At each step, the model proposes multiple thoughts rather than one and evaluates which are most promising. Search and pruning algorithms—like breadth-first search (BFS)—guide the exploration. This results in deeper analysis and better answers but requires extensive computation during inference, as the branching and evaluation happen in real time.

3. Direct Preference Optimization (DPO)

Direct Preference Optimization is a powerful fine-tuning method used widely in aligning models with feedback (such as human preferences). It works by comparing pairs of outputs to the same prompt—a preferred (“winning”) answer \(\hat{y}_w\) and a dispreferred (“losing”) answer \(\hat{y}_l\). The model learns to favor winning responses by maximizing the probability ratio between them:

\[ \mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_\text{ref}) = -\log\sigma\left(\beta\log\frac{\pi_\theta(\hat{y}_w|x)}{\pi_\text{ref}(\hat{y}_w|x)} - \beta\log\frac{\pi_\theta(\hat{y}_l|x)}{\pi_\text{ref}(\hat{y}_l|x)}\right) \]

Here, \(\pi_\theta\) is the model being trained, \(\pi_\text{ref}\) is the reference model (often the original base model), and \(\beta\) controls regularization strength. This approach doesn’t require explicit reward models or human labels—it directly fits the model to preference data.

CPO cleverly merges ToT’s exploration with DPO’s fine-tuning, teaching models not only what reasoning works but also what reasoning doesn’t.


The Core Method: How Chain of Preference Optimization Works

CPO reframes the challenge of improving reasoning quality. Instead of training on just the final best reasoning path from ToT, it learns from all the intermediate choices—including the discarded, suboptimal alternatives. These implicit preferences form a rich training signal.

The method unfolds in two stages: creating preference data and training with the CPO objective.

CPO framework overview. Left: generate and evaluate thoughts via a ToT-style process. Right: collect preferred (winning) and dispreferred (losing) thoughts to form training pairs.

The CPO workflow: ToT generates and evaluates multiple reasoning branches; CPO collects preferred and dispreferred thoughts and trains a fast CoT model using DPO on this per-step preference data.

Stage 1: Synthesizing the Chain of Preference Thoughts

To generate data, a ToT-style search is used:

  1. Thought Generation:
    Starting from a partial reasoning state, the LLM samples several possible next thoughts.
    Example: given “The easternmost state is Maine,” possible next thoughts might include “What is the capital of Maine?” or “What is the nickname of Maine?”.

  2. State Evaluation:
    The model self-evaluates each candidate’s usefulness—classifying them as likely or impossible to lead toward the correct answer. These scores are used to prune weak reasoning branches.

  3. Search and Collection:
    A search algorithm such as BFS expands the promising nodes until a complete reasoning path with a final answer emerges. Once done:

    • Thoughts on the winning path (\(z_i^w\)) are labeled preferred.
    • Other sibling thoughts at the same step (\(z_i^l\)) become dispreferred.

This process automatically constructs a dataset of stepwise preference pairs \((z_i^w, z_i^l)\) for each reasoning step, denoted as \(\mathcal{D}\). Unlike prior methods that use only the final path, CPO captures preferences throughout the entire reasoning chain.

Stage 2: Training with the CPO Objective

CPO applies the DPO loss at each reasoning step rather than on full paths. For each pair \((z_i^w, z_i^l)\) and prior context \(s_{i-1}^w\):

\[ \mathcal{L}_i(\pi_\theta; \pi_\text{ref}) = -\log\sigma\left(\beta\log\frac{\pi_\theta(z_i^w|x, s_{i-1}^w)}{\pi_\text{ref}(z_i^w|x, s_{i-1}^w)} - \beta\log\frac{\pi_\theta(z_i^l|x, s_{i-1}^w)}{\pi_\text{ref}(z_i^l|x, s_{i-1}^w)}\right) \]

The overall objective averages this per-step loss:

\[ \mathcal{L}_{\mathrm{CPO}}(\pi_\theta; \pi_\text{ref}) = \mathbb{E}_{(x, z_i^w, z_i^l, s_{i-1}^w)\sim \mathcal{D}}[\mathcal{L}_i(\pi_\theta; \pi_\text{ref})] \]

After fine-tuning, the LLM can, at inference time, generate a single reasoning chain (like CoT) that mirrors the deliberate, well-considered patterns ToT would have discovered—without the computational overhead of performing a search every time.


Experiments: Benchmarks and Results

The researchers tested CPO extensively across seven datasets covering three reasoning types:

  • Question Answering (QA): Bamboogle, WikiMultiHopQA, HotpotQA
  • Fact Verification: FEVER, FEVEROUS, VitaminC
  • Arithmetic Reasoning: SVAMP

They applied CPO to LLaMA2-7B, LLaMA2-13B, and Mistral-7B, comparing with three baselines:

  1. CoT: Standard reasoning with a single path.
  2. ToT: Full tree-search inference (strong but slow).
  3. TS-SFT (Tree-Search Supervised Fine-Tuning): COT trained only on ToT’s best paths.

Performance comparison showing accuracy and latency. CPO matches or exceeds ToT performance while maintaining low inference latency similar to CoT.

Experimental results across multiple reasoning tasks. CPO consistently achieves higher accuracy than CoT and TS-SFT, with latency close to CoT and vastly faster than ToT.

Key Findings

  • Better reasoning accuracy: CPO improves overall accuracy by an average of 4.3% compared to CoT.
  • Efficiency preserved: Despite matching ToT’s reasoning quality, CPO inference remains as fast as CoT—on average 57.5× faster than ToT.
  • Learning from mistakes matters: CPO outperforms TS-SFT because it learns not only from the winning reasoning paths but also from the losing ones, providing richer feedback signals.

Analysis: Understanding CPO’s Advantages

To probe why CPO works so effectively, the authors conducted detailed analyses. Two insights stand out.

1. The Power of Negative Examples

Is dispreferred data really necessary? The authors experimented by varying how much of the training data contained dispreferred thoughts—from 0% (only preferred thoughts) up to 100% (full CPO).

Model accuracy rises as more dispreferred thoughts are included in training. At 100% (CPO), it consistently outperforms SFT baseline.

As more dispreferred thoughts are included during training, accuracy increases steadily. Explicitly teaching the model which reasoning paths not to take proves crucial.

Results show that incorporating negative examples consistently improves performance. In other words, teaching an LLM what not to think enhances what it does think.

2. The “Longest Common Prefix” Problem

Couldn’t we apply DPO directly to full reasoning paths? That’s the idea behind Full-path Preference Optimization (FPO). However, FPO runs into a subtle issue called Longest Common Prefix (LCP) Gradient Cancellation.

Diagram comparing per-step (CPO) versus full-path (FPO) preference optimization. CPO applies DPO at each step, FPO only at the full sequence level.

CPO constructs preference pairs at each reasoning step, ensuring all tokens receive gradient updates. FPO compares complete paths, causing gradients of shared prefixes to cancel.

Because reasoning paths often share a long prefix—e.g.,
Winning: A → B → C → D
Losing: A → B → C → E
the gradients corresponding to the common prefix (A → B → C) mathematically cancel each other, leaving learning reinforcement only for the final, diverging step. This wastes valuable supervision from earlier reasoning choices.

CPO avoids this entirely by optimizing every step individually, ensuring no loss of gradient information. Empirically, the paper shows FPO underperforms both CPO and even simple supervised fine-tuning, validating this theoretical insight.


Conclusion and Future Directions

Chain of Preference Optimization offers an elegant solution to a fundamental trade-off in LLM reasoning: maintaining both depth and speed. By mining preference information from every branch of a ToT-style reasoning process, CPO trains models to emulate deliberate, multi-path thinking while retaining fast, single-path inference.

Key Takeaways:

  • Knowledge distillation from search: CPO transfers ToT-style deliberation into a streamlined CoT model.
  • Learning from all steps: Utilizing both preferred and dispreferred reasoning enriches the model’s understanding.
  • Stepwise optimization beats full-path optimization: Per-step preference learning avoids gradient cancellation and strengthens training signals.

Looking ahead, CPO could be applied to broader reasoning frameworks like Graph-of-Thoughts or AlphaZero-style tree searches, and even facilitate weak-to-strong alignment, where smaller models generate preference data to train stronger ones.

By teaching models how to prefer better thoughts, CPO marks an exciting leap toward faster, smarter, and more deliberate AI reasoning.