Introduction

In the rapidly evolving world of Large Language Models (LLMs), Reinforcement Learning from Human Feedback (RLHF) has become the gold standard for alignment. It is the secret sauce that turns a raw, unruly text predictor into a helpful assistant like ChatGPT. The logic behind RLHF seems intuitive: we train a “Reward Model” (RM) to act as a proxy for human preferences, and then we use that model to grade the LLM’s outputs.

The prevailing wisdom in the AI community follows a simple heuristic: Better Reward Model = Better Language Model.

If the Reward Model is more accurate at predicting what humans like, the Language Model should logically become better at generating it. Researchers and engineers spend massive amounts of compute and data trying to squeeze every percentage point of accuracy out of their Reward Models, assuming it will pay linear dividends in the final product.

But what if that assumption is wrong?

A fascinating new study, The Accuracy Paradox in RLHF, challenges this foundational belief. Through extensive experimentation with Question Answering tasks, the researchers uncovered a counter-intuitive reality: the “best” Reward Models—those with the highest classification accuracy—often lead to worse Language Models. Instead, moderately accurate Reward Models frequently produce superior results.

In this post, we will break down this paradox, explore why “smarter” judges can make for “dumber” students, and look at the data that turns conventional RLHF wisdom on its head.

Background: How RLHF and Reward Models Work

To understand the paradox, we first need to establish how the training loop functions. RLHF generally consists of three steps, but we will focus on the interaction between the latter two:

  1. Reward Model Training: We collect a dataset where humans compare two answers and pick the better one. A model (the RM) is trained to predict this preference. Its performance is measured by accuracy—how often it agrees with the human choice.
  2. Reinforcement Learning (PPO): The Language Model generates text. The Reward Model “grades” that text. The Language Model is then updated using an algorithm called Proximal Policy Optimization (PPO) to maximize those grades.

The goal is to maximize the Language Model’s performance (\(\mathcal{P}_{\mathrm{LM}}\)). We typically assume this performance is a function of the Reward Model’s strength (\(S_{\mathrm{RM}}\)) and the training duration (\(\tau\)).

Equation describing Language Model performance.

The researchers focused on three specific dimensions of text quality using the QA-FEEDBACK dataset:

  • Relevance: Does the answer address the question?
  • Factuality: Is the information true?
  • Completeness: Is the answer comprehensive?

They trained various versions of Reward Models based on the Longformer architecture, creating a spectrum of models ranging from “weak” to “highly accurate.”

Table 1: Training steps and accuracy ranges for reward models by task type.

As shown in Table 1 above, the researchers created a diverse set of judges. Some were trained for only a few steps (lower accuracy), and some were trained to convergence (high accuracy). The question was: Which judge produces the best student?

To verify the results objectively, the team didn’t just trust their own Reward Models. They used independent, high-accuracy “oracle” models to evaluate the final text generated by the LLMs.

The Accuracy Paradox: Visualizing the “Sweet Spot”

If the conventional wisdom were true, we would expect a linear relationship: as Reward Model accuracy increases (y-axis), the Language Model’s performance (z-axis/color) should increase.

However, the experimental results tell a completely different story. Let’s look at the 3D surface plots for the T5-small model across the three different tasks.

1. The Relevance Task

In the Relevance task, the goal is to keep the model on topic.

Figure 1: 3D surface plot evaluating relevance ratios for T5-small. Optimal performance was achieved with reward models having moderate accuracy.

Look at Figure 1. The vertical axis represents the final performance of the Language Model. You might expect the highest peak (the yellow/red zone) to be at the top of the “RM Accuracy” axis.

Instead, the peak is in the middle. The most accurate Reward Models (top of the y-axis) actually result in lower performance than those with moderate accuracy. The surface dips significantly as the Reward Model becomes “too good” or is trained for too many steps.

2. The Factuality Task

The results for Factuality (ensuring the model doesn’t hallucinate) show a similar trend.

Figure 2: 3D surface plot evaluating factuality ratios for T5-small. The best performance was seen with reward models of moderate accuracy.

In Figure 2, we again see a “sweet spot” (represented by the yellow zone). Pushing the Reward Model to its absolute limit in accuracy does not yield the most factual Language Model. The distinct curvature suggests that after a certain threshold, a stronger Reward Model begins to degrade the downstream performance of the generator.

3. The Completeness Task

Finally, for Completeness (providing thorough answers), the pattern holds firm.

Figure 3: 3D surface plot evaluating completeness rewards for T5-small. Intermediate reward model strength yielded the best language model performance.

Figure 3 demonstrates that intermediate Reward Model strength yields the best outcomes. The drop-off at high accuracy levels is stark.

The Verdict

Across all three tasks—and verified on larger models like T5-base and T5-large (see paper appendices)—the data is consistent. Optimal performance is achieved with reward models of moderate accuracy. This is the Accuracy Paradox.

Deep Dive: Why Do “Worse” Judges Work Better?

To understand why this happens, the authors compared the behaviors of two specific types of Reward Models during the training process:

  1. The Most Accurate RM: The model with the highest binary classification score on the test set.
  2. The Best-Performing RM: The model (usually moderate accuracy) that actually produced the best LLM.

By analyzing the rewards these two models distributed during training, we can identify distinct strategies.

Aggressive vs. Conservative Rewarding

In the Relevance task, the Best-Performing RM adopted a surprising strategy: it was more “aggressive.”

Figure 4: Reward analysis for relevance task (T5-small model): training steps vs. rewards (left), mean and variance of rewards (right).

As seen in Figure 4, the Best-Performing RM (green/teal dots) gave rewards with a significantly higher mean and higher variance compared to the Most Accurate RM (orange dots).

Why helps? High variance in rewards allows the Language Model to distinguish clearly between “okay” and “great.” It creates a stronger gradient signal. The “Most Accurate” RM, perhaps being too confident or rigid, gave flatter, lower rewards, failing to push the model toward better relevance.

However, the strategy changes depending on the task. Look at the Completeness task in Figure 6 below.

Figure 6: Reward analysis for completeness task (T5-small model): training steps vs. rewards (left), mean and variance of rewards (right).

Here, the Best-Performing RM (green) actually gave lower rewards on average (a lower mean) but maintained higher variance. This “conservative” strategy likely prevents the model from rambling just to maximize length (a common loophole in completeness tasks). It penalizes effectively while still offering enough variance to guide the model.

The takeaway: Moderately accurate models seem to naturally possess reward distributions (variance and mean) that are better suited for the dynamics of Reinforcement Learning, whereas highly accurate models may become too rigid or prone to overfitting the specific examples they were trained on.

The Role of KL Divergence: Stability vs. Overfitting

The final piece of the puzzle lies in KL Divergence.

In RLHF, we don’t want the Language Model to drift too far from its original training (Supervised Fine-Tuning). We measure this drift using Kullback-Leibler (KL) Divergence.

Equation for KL Divergence.

Ideally, we want the model to learn (drift slightly) without breaking (drifting massively or collapsing).

When the researchers analyzed the KL divergence trends, they found that the “Most Accurate” Reward Models often caused unstable or restrictive training dynamics.

Stability in Factuality

In the Factuality task, we see a clear distinction in how the models allow the LLM to learn.

Figure 8: Factuality task KL divergence (T5-small model): training steps vs. KL divergence (left), mean and variance of rewards (right).

In Figure 8, the Best-Performing RM (green) allows for higher KL divergence (around 4.0) compared to the Most Accurate RM (orange), which stays lower.

This suggests the Most Accurate RM might be constraining the Language Model too much. By penalizing deviations too strictly, it prevents the model from exploring new ways to be factual. The moderately accurate model acts as a “loose leash,” allowing enough exploration to find better answers without wandering off completely.

Flexibility in Completeness

The Completeness task (Figure 9) shows the Best-Performing RM exhibiting higher variance in KL divergence.

Figure 9: Completeness task KL divergence (T5-small model): training steps vs. KL divergence (left), mean and variance of rewards (right).

This higher variance indicates flexibility. The model can make large updates when necessary to learn a complex concept (like how to write a complete paragraph) and small updates elsewhere. The Most Accurate RM, by contrast, forces a more uniform, low-variance path that leads to suboptimal results.

Conclusion

The race for higher metrics in AI often blinds us to the dynamics of the system as a whole. This research paper provides a crucial course correction for anyone working with RLHF.

The key takeaways are:

  1. The Paradox is Real: A Reward Model with 95% accuracy might produce a worse Chatbot than one with 75% accuracy.
  2. Accuracy \(\neq\) Alignment: Classification accuracy is a static metric. RLHF is a dynamic process. Moderately accurate models often provide better training signals (better variance/mean balance).
  3. Avoid Overfitting the Judge: Highly accurate Reward Models may be overfitted to their training data, leading to “reward hacking” or overly constrained Language Models that fail to generalize.

Implications for the Future

This study suggests that instead of blindly maximizing Reward Model accuracy, practitioners should focus on “Goldenilocks” models—judges that are accurate enough to be correct, but uncertain enough to allow the Language Model to explore and learn.

Future work in this space will likely focus on Out-of-Distribution (OOD) evaluation. We need to ensure that Reward Models aren’t just memorizing training examples but can generalize to new, unseen prompts. Until then, remember: in RLHF, a perfect teacher doesn’t always make a perfect student. Sometimes, a teacher who is willing to be a little flexible gets the best results.