If you have ever asked a Large Language Model (LLM) like ChatGPT to solve a complex math problem, you might have noticed a fascinating quirk. Sometimes, the model gets the right answer for the wrong reasons. Other times, it starts perfectly, makes a single logical slip in the middle, and spirals into a hallucination.
This inconsistency stems from how these models process reasoning. They generate text token by token, and once a mistake is made, it’s hard for the model to recover. To fix this, researchers have been developing “Verifiers”—auxiliary models designed to check the LLM’s work.
Traditionally, these verifiers operate like a strict teacher grading a multiple-choice test: they mark the final answer as correct or incorrect. Or, in more advanced versions, they mark individual steps as right or wrong. But is reasoning really that black and white?
In a recent paper, researchers from Zhejiang University propose a more nuanced approach. Instead of binary “correct/incorrect” labels, they introduce Tree-Based Preference Learning Verifier (Tree-PLV). This method teaches the model to understand that some reasoning steps are better than others, even if they aren’t the final answer.
In this post, we will dive into why traditional verification fails, how Tree-PLV constructs “reasoning trees” to capture nuance, and why this approach is setting new benchmarks in arithmetic and commonsense reasoning.
The Problem with Binary Thinking
Before we dissect the new solution, we need to understand the current state of LLM reasoning.
The “Best-of-N” Strategy
A common technique to boost LLM performance is the Best-of-N strategy. Instead of asking the model for one answer, we ask it to generate \(N\) different solutions (e.g., 64 different paths). Then, a Verifier ranks these solutions and picks the best one.
Outcome vs. Process Supervision
How does the Verifier know which path is best? There are generally two schools of thought, as illustrated in the figure below:
- Outcome Supervision: The Verifier looks at the final answer. If it matches the ground truth, the whole path is labeled “Positive.” If not, it’s “Negative.” This is easy to do but noisy. A model might arrive at the correct answer through a lucky guess or flawed logic.
- Process Supervision: The Verifier checks every single step. This provides better feedback but requires massive amounts of detailed, human-labeled data (which is expensive) or heuristic checks (which can be inaccurate).

As shown in Figure 1, traditional methods (left and middle) rely on binary labels. They ask, “Is this correct?”
The authors of the Tree-PLV paper argue that this binary approach is insufficient. Reasoning is often about relative quality. Step A might be “technically correct” but unhelpful, while Step B is a brilliant insight that simplifies the problem. A binary label treats them equally. Furthermore, binary classification is highly susceptible to noise; if a dataset has errors, the verifier learns the wrong lessons rigidly.
The Solution: Tree-Based Preference Learning
The researchers propose shifting from binary classification to preference learning. Instead of training a model to output a probability of “correctness” (a scalar value), they train it to compare two paths and identify which one is better.
To achieve this, they built a system called Tree-PLV. The core innovation lies in how they generate the training data and how they calculate the “score” for each step.
1. Constructing the Reasoning Tree
Most LLMs generate reasoning as a linear chain (Step 1 \(\rightarrow\) Step 2 \(\rightarrow\) Step 3). Tree-PLV visualizes reasoning as a tree. The root is the problem statement. From there, the model branches out into different possible first steps. From each of those, it branches into second steps, and so on.
The authors use a Best-First Search (BFS) algorithm to build this tree. They don’t just randomly explore; they actively pursue the most promising branches.

Figure 2 illustrates this process clearly.
- Expansion: The system takes the current state (e.g., the problem + Step 1) and generates several possible “next steps” (Step 2 candidates).
- Evaluation: It needs to know which candidate is best.
- Selection: It picks the winner and continues branching from there.
But wait—how does the system know which step is “promising” before it even finishes the problem?
2. The Look-Ahead Reward Function
This is the clever part. To judge the quality of an intermediate step (let’s say, Step 2), the system performs a “look-ahead” simulation.
From that specific step, the model runs \(N\) simulations (called completions) until it reaches a final answer. The “Reward” for that step is simply the percentage of those simulations that end up with the correct answer.
The formula for the reward \(\mathcal{R}(y_i)\) for step \(y_i\) is:

In this equation:
- \(N\) is the number of simulations run from step \(y_i\).
- \(a[P_i^j] = g\) checks if the outcome of the \(j\)-th simulation matches the golden answer \(g\).
- Essentially, if you stand at Step 2 and run 100 simulations, and 85 of them lead to the correct answer, the reward for Step 2 is 0.85.
This method leverages the model’s own “intuition” (or latent knowledge) to assign a granular score to every step, rather than a binary 0 or 1.
3. Creating Training Pairs
Once the tree is built and every node has a reward score, the researchers generate the training data. They don’t feed the raw scores to the Verifier. Instead, they create pairs.
They look at “sibling” nodes—two different steps branching from the same parent.
- Path A (\(y^+\)): A step with a high reward (leads to the correct answer often).
- Path B (\(y^-\)): A step with a low reward (leads to wrong answers).
They filter these pairs to ensure there is a significant margin of difference between them (to avoid confusing the model with two equally mediocre steps). This results in a dataset of triplets: {Problem, Better Path, Worse Path}.
4. Pairwise Training
Finally, the Verifier is trained using a ranking loss. The goal is to maximize the difference between the score assigned to the “Better Path” and the “Worse Path.”

The equation above (\(\mathcal{L}\)) pushes the model to assign a higher scalar value to \(y^+\) than \(y^-\). By training on these comparisons, the Verifier learns the nuance of reasoning. It learns to recognize the subtle features that make one logical step superior to another.
Experimental Results
The theory sounds solid, but does it work? The researchers tested Tree-PLV on four major benchmarks:
- Arithmetic: GSM8K (grade school math) and MATH (challenging competition math).
- Commonsense: CSQA and StrategyQA.
They compared Tree-PLV against strong baselines, including:
- Self-Consistency: The standard “majority vote” method.
- ORM (Outcome Reward Model): A verifier trained on binary outcome labels.
- Math-Shepherd: A state-of-the-art process verifier.
Accuracy Gains
The results were impressive. Tree-PLV outperformed all baselines across all datasets.

As seen in Table 1, the gains are substantial. For example, using the Mistral-7B model on the GSM8K dataset:
- Self-Consistency achieved 67.55%.
- Tree-PLV achieved 82.79%.
That is a massive 15% jump in accuracy simply by changing how the verification is handled. Even on the extremely difficult MATH dataset, Tree-PLV improved performance from 17.00% to 26.80%.
The method also scaled well when using stronger generator models (like WizardMath), as shown below:

Efficiency and Robustness
One might worry that generating 64 solutions (Best-of-N) is computationally expensive. The researchers analyzed how the Verifier performs as the number of candidate solutions (\(N\)) increases.

Figure 3 shows that Tree-PLV (the purple line) is not only superior at \(N=64\), but it is also highly efficient. It achieves higher accuracy with just 10 solutions than the Self-Consistency baseline does with 64. This suggests that Tree-PLV is exceptionally good at identifying the “needle in the haystack” quickly.
Why Does It Work? The Power of Granularity
To prove that step-level preference was the secret sauce, the authors ran an ablation study comparing different levels of feedback:
- Instance-level binary (Did the whole path succeed?)
- Instance-level preference (Is this whole path better than that one?)
- Step-level preference (The Tree-PLV method).

The results in Figure 4 are clear. The red bars (Step-Level Preference) consistently beat the others. This confirms that the granular signal—teaching the model exactly where the reasoning went right or wrong—is superior to broad supervision.
Separating Truth from Confidence
A major issue with LLMs is that they are often confident even when they are wrong. A good verifier needs to distinguish between a “confident hallucination” and a “confident correct answer.”

Figure 6 visualizes this beautifully.
- Left (a): The raw LLM confidence scores for correct (green) and wrong (orange) answers overlap significantly. The model doesn’t know when it’s wrong.
- Right (b): The Tree-PLV verifier scores show a distinct separation. The green peak (correct) is far to the right, and the orange peak (wrong) is flattened to the left.
This separation capability is exactly what makes Tree-PLV so effective at ranking candidate solutions.
Conclusion
The Tree-PLV paper represents a significant step forward in making Large Language Models more reliable. By acknowledging that reasoning is not just a binary “True/False” game but a branching tree of better and worse decisions, the researchers have created a verification system that is:
- More Accurate: Significant gains on math and commonsense benchmarks.
- More Robust: Less sensitive to noisy training data.
- More Efficient: Finds the best answer with fewer attempts.
For students and practitioners in AI, this highlights a growing trend: improvements in LLMs won’t just come from bigger models or more data, but from better training objectives that align more closely with the complex, step-by-step nature of human reasoning. Moving from binary labels to preference learning is a prime example of this alignment in action.
](https://deep-paper.org/en/paper/2407.00390/images/cover.png)