Imagine you are a professor grading a math exam. You come across a student who has written the correct final answer, “42,” but their working out involves subtracting apples from oranges and dividing by the color blue. Do you give them full marks?
In the world of Large Language Models (LLMs), the answer has traditionally been “yes.”
Current methods for training LLMs to verify answers often focus solely on the final output. If the model guesses the right answer, it gets a reward, regardless of the logical leaps or hallucinations it took to get there. This creates a dangerous “Clever Hans” effect where models learn to act correctly but think incorrectly.
In this deep dive, we will explore a fascinating research paper titled “Rationale-Aware Answer Verification by Pairwise Self-Evaluation.” We will uncover why correct answers with flawed reasoning are a major problem, and look at a novel solution called REPS (Rationale Enhancement through Pairwise Selection) that teaches models to grade themselves not just on what they answer, but how they derived it.
The “Right Answer, Wrong Reason” Paradox
Reasoning is central to the promise of modern AI. With techniques like Chain-of-Thought (CoT) prompting, we ask LLMs to generate intermediate reasoning steps—or rationales—before arriving at a final answer. The assumption is that if the model explains its work, the answer is more likely to be correct and trustworthy.
However, LLMs are prone to hallucinations and logical inconsistencies. They might generate a chain of thought that is factually wrong yet somehow land on the correct final answer (perhaps by chance or memorization).
The researchers behind this paper posed a critical question: How often do models generate correct answers with flawed rationales?
To answer this, they tested the StrategyQA dataset (a benchmark requiring multi-hop reasoning) and used GPT-4 to judge the quality of the reasoning. The results were startling:
- 59% of the solutions generated had the correct final answer.
- However, only 19% of those correct solutions actually had valid, logical rationales.
This reveals a massive gap. If we train a “Verifier” model (a model designed to score the quality of an answer) using only the final answer as the ground truth, we are feeding it “noisy” data. We are effectively teaching the verifier that bad logic is acceptable as long as the guess is lucky.

As shown in Figure 1, a standard verifier (blue) treats all “correct answer” solutions as positive training examples. This confuses the model, making it unable to distinguish between a hallucinated explanation and a sound deduction. A rationale-aware verifier (yellow), however, is trained to filter out the lucky guesses.
Background: How Verifiers Work
Before we look at the solution, let’s briefly establish how Answer Verification typically works.
In a standard setup, you have two components:
- Generator (\(M_g\)): The LLM that produces multiple candidate solutions (rationale + answer) for a question.
- Verifier (\(M_v\)): A model that looks at a candidate solution and predicts the probability that it is correct.
During inference (actual use), the Generator creates several options, and the Verifier picks the best one.
\[ s ^ { * } = \arg \operatorname* { m a x } _ { s _ { i } } M _ { v } ( s _ { i } \mid q ) \]The equation above simply says: the selected solution \(s^*\) is the one that the Verifier \(M_v\) assigns the highest score to, given question \(q\).
The Verifier is usually trained as a binary classifier. It looks at a solution and tries to output 1 (correct) or 0 (incorrect). The training loss function minimizes the difference between the verifier’s score and the actual label:
The problem lies in \(y_i\) (the label). In standard approaches, \(y_i = 1\) if the final answer matches the gold standard. The researchers argue this is insufficient. \(y_i\) should only be 1 if the answer matches AND the rationale is sound.
The Impact of Data Quality
Does training on better rationales actually help? The authors conducted a controlled experiment to find out. They created three datasets to train a verifier:
- Low-Quality: Created by artificially attaching correct answers to incorrect reasoning.
- Baseline: The standard approach—any solution with the correct final answer is a positive example.
- High-Quality: Only solutions where GPT-4 verified the reasoning logic are positive examples.
They then tested these verifiers on a special test set designed to trick them (containing options with correct answers but bad logic).

Figure 2 illustrates the results. Notice the “Rationale Accuracy” (the orange bars). The verifier trained on the High-Quality dataset (far right) is significantly better at spotting valid logic than the Baseline.
Furthermore, the researchers found a direct correlation between the amount of high-quality data and performance.

As shown in Figure 3, as you increase the ratio of valid rationales in the training set (moving from left to right), the Rationale Accuracy (orange line) climbs steadily. This proves that rationale quality is crucial for building trustworthy AI.
The Core Method: REPS
We know we need high-quality rationales to train a good verifier. But we can’t manually annotate millions of reasoning paths, and using GPT-4 for everything is expensive and slow. We need a way to automate this.
The authors introduce REPS: Rationale Enhancement through Pairwise Selection.
REPS relies on an interesting property of LLMs: they are often better at comparing two options (A vs. B) than they are at generating a perfect answer from scratch or assigning an abstract score (1-10) to a single answer.
The REPS Process
The method works in four main phases, visualized below:

Phase 1: Generation The Generator model produces a set of candidate solutions for a question.
Phase 2: Answer Filtering We immediately discard any solution where the final answer is wrong. (We assume that if the answer is wrong, the rationale is definitely not useful for training).
Phase 3: The Tournament (Pairwise Selection) This is the innovation. We take the remaining solutions (which all have the “correct” answer but varying qualities of reasoning) and enter them into a tournament.
- The LLM acts as a judge.
- It looks at two rationales (\(r_i\) and \(r_j\)).
- It decides which one is more factually grounded and logically consistent.
- The winner advances to the next round.
- This repeats until one “champion” rationale remains.
Phase 4: Training The champion rationale is labeled as a “positive” sample (\(y=1\)). The losers (and the wrong-answer solutions) are negative samples. We use this clean dataset to train the Verifier.
Here is the algorithm in pseudocode:

By using this tournament structure, REPS effectively filters out the “lucky guesses”—the solutions that got the right answer for the wrong reasons—leaving only the most robust reasoning paths for training.
Experimental Results
The authors tested REPS against the standard baseline on three distinct datasets:
- ARC-Challenge: Science questions requiring commonsense reasoning.
- DROP: Reading comprehension requiring arithmetic reasoning.
- StrategyQA: Multi-hop questions requiring implicit reasoning steps.
Metric: Rationale Accuracy vs. Task Performance
They measured two things:
- Task Performance: Can the verifier pick the correct final answer?
- Rationale Accuracy: Can the verifier pick the solution that is actually valid (logically sound)? \[ \begin{array} { r l } & { { \bf R A } = \displaystyle \frac { 1 } { | D _ { \mathrm { t e s t } } | } \sum _ { i = 1 } ^ { | D _ { \mathrm { t e s t } } | } \mathbb { I } \left[ \arg \underset { s \in S _ { i } } { \operatorname* { m a x } } M _ { v } ( s ) = s _ { \mathrm { v a l i d } } \right] } \\ & { { \bf A } { \bf A } = \displaystyle \frac { 1 } { | D _ { \mathrm { t e s t } } | } \sum _ { i = 1 } ^ { | D _ { \mathrm { t e s t } } | } \mathbb { I } \left[ \arg \underset { s \in S _ { i } } { \operatorname* { m a x } } M _ { v } ( s ) \in s _ { \mathrm { g o o d } } \right] } \end{array} \] The Results: REPS significantly improved Rationale Accuracy across all datasets.
- ARC: +14.1% improvement.
- StrategyQA: +8.8% improvement.
- DROP: +4.9% improvement.
Crucially, Task Performance (getting the right answer) remained stable or improved slightly. This means we are getting “safer” and more interpretable models without sacrificing accuracy.
Head-to-Head Comparison
To verify that REPS was actually selecting better rationales, the researchers pitted the REPS-selected rationales against the Baseline-selected rationales and asked GPT-4 to judge the winner.

As seen in Figure 6, REPS won the majority of the time (orange bars > 50%) across all three datasets. This confirms that the self-evaluation tournament successfully identifies superior reasoning.
The Comparison vs. Scoring Debate
You might wonder: Why do we need a tournament? Why not just ask the LLM to score each rationale from 0 to 100?
The authors compared REPS (pairwise comparison) against G-EVAL, a method that asks LLMs to assign a scalar score to a solution.

Table 2 shows that REPS consistently beats G-EVAL. This supports the hypothesis that LLMs are more reliable relative judges (A > B) than absolute judges (A is an 8/10). The pairwise mechanism is key to the method’s success.
Analysis: The Pitfalls of Length Bias
While REPS is powerful, the paper uncovers a fascinating limitation inherent in LLMs.
In the REPS algorithm, you can adjust:
- \(N\): The number of candidate solutions generated.
- \(S\): The number of comparisons (votes) per match.
Intuitively, increasing \(N\) (more candidates) and \(S\) (more rigorous voting) should improve the results. More diversity + more checking = better quality, right?
Surprisingly, the opposite happened.

Look at Figure 5.
- Left Chart: As \(N\) increases (x-axis), the accuracy (lines) actually drops. Simultaneously, the average length of the selected rationale (right y-axis) increases.
- Right Chart: As \(S\) increases, a similar trend occurs.
Why? This reveals a known bias in LLMs: verbosity bias. When acting as judges, LLMs tend to prefer longer answers, mistaking length for depth or correctness.
As the tournament gets larger (\(N\)) or more rigorous (\(S\)), the selection process inadvertently amplifies this bias. The “longer” answers keep winning the pairwise matchups, eventually crowding out the “correct but concise” answers. This suggests that while REPS is effective, it requires careful tuning of hyperparameters to prevent the model from drifting toward purely verbose outputs.
Conclusion and Implications
The paper “Rationale-Aware Answer Verification by Pairwise Self-Evaluation” makes a compelling case that we cannot trust correct answers alone. As AI systems are integrated into critical decision-making processes, the reasoning behind an answer becomes just as important as the answer itself.
Key Takeaways:
- The Illusion of Correctness: A correct answer often hides flawed logic. In StrategyQA, 81% of correct answers had invalid rationales.
- REPS Works: By using the model to debate itself via pairwise comparisons, we can filter for high-quality rationales without human intervention.
- Training Matters: Verifiers trained on these refined rationales are much better at distinguishing between sound logic and hallucinations.
- Bias Awareness: We must remain vigilant about LLM biases, such as the preference for longer text, which can degrade performance if left unchecked.
This work represents a significant step toward “Scalable Oversight”—the ability to use AI to supervise other AI, ensuring that our models are not just lucky guessers, but reliable reasoners.
](https://deep-paper.org/en/paper/2410.04838/images/cover.png)