Beyond Right or Wrong: Teaching LLMs to Reason with Tree-Based Preference Learning

If you have ever asked a Large Language Model (LLM) like ChatGPT to solve a complex math problem, you might have noticed a fascinating quirk. Sometimes, the model gets the right answer for the wrong reasons. Other times, it starts perfectly, makes a single logical slip in the middle, and spirals into a hallucination.

This inconsistency stems from how these models process reasoning. They generate text token by token, and once a mistake is made, it’s hard for the model to recover. To fix this, researchers have been developing “Verifiers”—auxiliary models designed to check the LLM’s work.

Traditionally, these verifiers operate like a strict teacher grading a multiple-choice test: they mark the final answer as correct or incorrect. Or, in more advanced versions, they mark individual steps as right or wrong. But is reasoning really that black and white?

In a recent paper, researchers from Zhejiang University propose a more nuanced approach. Instead of binary “correct/incorrect” labels, they introduce Tree-Based Preference Learning Verifier (Tree-PLV). This method teaches the model to understand that some reasoning steps are better than others, even if they aren’t the final answer.

In this post, we will dive into why traditional verification fails, how Tree-PLV constructs “reasoning trees” to capture nuance, and why this approach is setting new benchmarks in arithmetic and commonsense reasoning.

The Problem with Binary Thinking

Before we dissect the new solution, we need to understand the current state of LLM reasoning.

The “Best-of-N” Strategy

A common technique to boost LLM performance is the Best-of-N strategy. Instead of asking the model for one answer, we ask it to generate \(N\) different solutions (e.g., 64 different paths). Then, a Verifier ranks these solutions and picks the best one.

Outcome vs. Process Supervision

How does the Verifier know which path is best? There are generally two schools of thought, as illustrated in the figure below:

Outcome Supervision: The Verifier looks at the final answer. If it matches the ground truth, the whole path is labeled “Positive.” If not, it’s “Negative.” This is easy to do but noisy. A model might arrive at the correct answer through a lucky guess or flawed logic.
Process Supervision: The Verifier checks every single step. This provides better feedback but requires massive amounts of detailed, human-labeled data (which is expensive) or heuristic checks (which can be inaccurate).

Figure 1: A comparison of different methods: Traditional verifiers rely on binary labels for outcome and process supervision, whereas Tree-PLV employs preferences instead of scalar values.

As shown in Figure 1, traditional methods (left and middle) rely on binary labels. They ask, “Is this correct?”

The authors of the Tree-PLV paper argue that this binary approach is insufficient. Reasoning is often about relative quality. Step A might be “technically correct” but unhelpful, while Step B is a brilliant insight that simplifies the problem. A binary label treats them equally. Furthermore, binary classification is highly susceptible to noise; if a dataset has errors, the verifier learns the wrong lessons rigidly.

The Solution: Tree-Based Preference Learning

The researchers propose shifting from binary classification to preference learning. Instead of training a model to output a probability of “correctness” (a scalar value), they train it to compare two paths and identify which one is better.

To achieve this, they built a system called Tree-PLV. The core innovation lies in how they generate the training data and how they calculate the “score” for each step.

1. Constructing the Reasoning Tree

Most LLMs generate reasoning as a linear chain (Step 1 \(\rightarrow\) Step 2 \(\rightarrow\) Step 3). Tree-PLV visualizes reasoning as a tree. The root is the problem statement. From there, the model branches out into different possible first steps. From each of those, it branches into second steps, and so on.

The authors use a Best-First Search (BFS) algorithm to build this tree. They don’t just randomly explore; they actively pursue the most promising branches.

Figure 2: The construction process of the reasoning tree. Best-first search consistently selects the child node with highest reward for further expansion. To evaluate the quality of the i-th step, we sample N completions from it.

Figure 2 illustrates this process clearly.

Expansion: The system takes the current state (e.g., the problem + Step 1) and generates several possible “next steps” (Step 2 candidates).
Evaluation: It needs to know which candidate is best.
Selection: It picks the winner and continues branching from there.

But wait—how does the system know which step is “promising” before it even finishes the problem?

2. The Look-Ahead Reward Function

This is the clever part. To judge the quality of an intermediate step (let’s say, Step 2), the system performs a “look-ahead” simulation.

From that specific step, the model runs \(N\) simulations (called completions) until it reaches a final answer. The “Reward” for that step is simply the percentage of those simulations that end up with the correct answer.

The formula for the reward \(\mathcal{R}(y_i)\) for step \(y_i\) is:

Equation 1: Reward Calculation

In this equation:

\(N\) is the number of simulations run from step \(y_i\).
\(a[P_i^j] = g\) checks if the outcome of the \(j\)-th simulation matches the golden answer \(g\).
Essentially, if you stand at Step 2 and run 100 simulations, and 85 of them lead to the correct answer, the reward for Step 2 is 0.85.

This method leverages the model’s own “intuition” (or latent knowledge) to assign a granular score to every step, rather than a binary 0 or 1.

3. Creating Training Pairs

Once the tree is built and every node has a reward score, the researchers generate the training data. They don’t feed the raw scores to the Verifier. Instead, they create pairs.

They look at “sibling” nodes—two different steps branching from the same parent.

Path A (\(y^+\)): A step with a high reward (leads to the correct answer often).
Path B (\(y^-\)): A step with a low reward (leads to wrong answers).

They filter these pairs to ensure there is a significant margin of difference between them (to avoid confusing the model with two equally mediocre steps). This results in a dataset of triplets: {Problem, Better Path, Worse Path}.

4. Pairwise Training

Finally, the Verifier is trained using a ranking loss. The goal is to maximize the difference between the score assigned to the “Better Path” and the “Worse Path.”

Equation 2: Ranking Loss

The equation above (\(\mathcal{L}\)) pushes the model to assign a higher scalar value to \(y^+\) than \(y^-\). By training on these comparisons, the Verifier learns the nuance of reasoning. It learns to recognize the subtle features that make one logical step superior to another.

Experimental Results

The theory sounds solid, but does it work? The researchers tested Tree-PLV on four major benchmarks:

Arithmetic: GSM8K (grade school math) and MATH (challenging competition math).
Commonsense: CSQA and StrategyQA.

They compared Tree-PLV against strong baselines, including:

Self-Consistency: The standard “majority vote” method.
ORM (Outcome Reward Model): A verifier trained on binary outcome labels.
Math-Shepherd: A state-of-the-art process verifier.

Accuracy Gains

The results were impressive. Tree-PLV outperformed all baselines across all datasets.

Table 1: Results comparison (accuracy %) on arithmetic and commonsense reasoning tasks.

As seen in Table 1, the gains are substantial. For example, using the Mistral-7B model on the GSM8K dataset:

Self-Consistency achieved 67.55%.
Tree-PLV achieved 82.79%.

That is a massive 15% jump in accuracy simply by changing how the verification is handled. Even on the extremely difficult MATH dataset, Tree-PLV improved performance from 17.00% to 26.80%.

The method also scaled well when using stronger generator models (like WizardMath), as shown below:

Table 2: Results (accuracy %) of the arithmetic reasoning task on generators with stronger capabilities.

Efficiency and Robustness

One might worry that generating 64 solutions (Best-of-N) is computationally expensive. The researchers analyzed how the Verifier performs as the number of candidate solutions (\(N\)) increases.

Figure 3: Performance of different verifiers across varying numbers of solution (N) generated by Mistral-7B.

Figure 3 shows that Tree-PLV (the purple line) is not only superior at \(N=64\), but it is also highly efficient. It achieves higher accuracy with just 10 solutions than the Self-Consistency baseline does with 64. This suggests that Tree-PLV is exceptionally good at identifying the “needle in the haystack” quickly.

Why Does It Work? The Power of Granularity

To prove that step-level preference was the secret sauce, the authors ran an ablation study comparing different levels of feedback:

Instance-level binary (Did the whole path succeed?)
Instance-level preference (Is this whole path better than that one?)
Step-level preference (The Tree-PLV method).

Figure 4: A performance comparison of verifiers trained with different levels of feedback granularity.

The results in Figure 4 are clear. The red bars (Step-Level Preference) consistently beat the others. This confirms that the granular signal—teaching the model exactly where the reasoning went right or wrong—is superior to broad supervision.

Separating Truth from Confidence

A major issue with LLMs is that they are often confident even when they are wrong. A good verifier needs to distinguish between a “confident hallucination” and a “confident correct answer.”

Figure 6: Distributions of the LLM’s generation confidence and verifier score on correct/incorrect reasoning paths.

Figure 6 visualizes this beautifully.

Left (a): The raw LLM confidence scores for correct (green) and wrong (orange) answers overlap significantly. The model doesn’t know when it’s wrong.
Right (b): The Tree-PLV verifier scores show a distinct separation. The green peak (correct) is far to the right, and the orange peak (wrong) is flattened to the left.

This separation capability is exactly what makes Tree-PLV so effective at ranking candidate solutions.

Conclusion

The Tree-PLV paper represents a significant step forward in making Large Language Models more reliable. By acknowledging that reasoning is not just a binary “True/False” game but a branching tree of better and worse decisions, the researchers have created a verification system that is:

More Accurate: Significant gains on math and commonsense benchmarks.
More Robust: Less sensitive to noisy training data.
More Efficient: Finds the best answer with fewer attempts.

For students and practitioners in AI, this highlights a growing trend: improvements in LLMs won’t just come from bigger models or more data, but from better training objectives that align more closely with the complex, step-by-step nature of human reasoning. Moving from binary labels to preference learning is a prime example of this alignment in action.

The Problem with Binary Thinking#

The “Best-of-N” Strategy#

Outcome vs. Process Supervision#

The Solution: Tree-Based Preference Learning#

1. Constructing the Reasoning Tree#

2. The Look-Ahead Reward Function#

3. Creating Training Pairs#

4. Pairwise Training#

Experimental Results#

Accuracy Gains#

Efficiency and Robustness#

Why Does It Work? The Power of Granularity#

Separating Truth from Confidence#

Conclusion#