Beyond Ranking: Why Your LLM Should Care About the Magnitude of Rewards

If you have played around with Large Language Models (LLMs) like ChatGPT or Claude, you know that “alignment” is the secret sauce. A base model trained on the internet is a chaotic completion engine; it takes Reinforcement Learning from Human Feedback (RLHF) to turn that chaos into a helpful assistant.

For a long time, the standard recipe for alignment was Proximal Policy Optimization (PPO). But PPO is complex, unstable, and computationally expensive. Recently, the field has shifted toward simpler, “order-based” methods like Direct Preference Optimization (DPO). These methods look at two answers—one good, one bad—and tell the model, “Prefer A over B.”

But there is a flaw in this logic. What if A is only slightly better than B? What if A is miles better than B? Order-based methods treat these scenarios almost identically. They care about the ranking, not the value.

In this post, we are doing a deep dive into the research paper “Don’t Forget Your Reward Values: Language Model Alignment via Value-based Calibration.” We will explore why ignoring the magnitude of rewards leads to suboptimal models and how a new method, Value-based CaliBration (VCB), fixes this by mathematically enforcing that the probability gap between answers matches their quality gap.

The Problem: When “Better” Isn’t Enough

Let’s start with the intuition. Imagine you are teaching a student.

  1. Scenario 1: The student writes an essay that scores 90/100. Their previous attempt was an 89/100.
  2. Scenario 2: The student writes an essay that scores 90/100. Their previous attempt was a 10/100.

In both cases, the new essay is “better.” If you simply tell the student “Result 2 > Result 1,” you are omitting crucial information. In Scenario 2, they should drastically shift their behavior toward the new essay. In Scenario 1, the difference is negligible.

Current popular alignment methods (like RRHF, SLiC, and DPO) are Order-based. They operate on preference pairs \((y_w, y_l)\), where \(y_w\) is the winner and \(y_l\) is the loser. They optimize the model to increase the probability of the winner and decrease the loser.

The researchers highlight a critical limitation here: These methods discard the actual reward values.

Order-based method Vs. Value-based method.

Take a look at Figure 1 above.

  • Rewards (Blue line): This is the ground truth. Response \(y_3\) (0.9) and \(y_2\) (0.85) are both good. Response \(y_1\) (0.1) is terrible.
  • Order Calibration (Green line): This represents methods like DPO. They only care that \(y_3 > y_2 > y_1\). Notice how the probability for \(y_2\) (the middle dot) is pushed too far down? It treats the gap between \(y_3\) and \(y_2\) (0.05 difference) similarly to the gap between \(y_2\) and \(y_1\) (0.75 difference). This results in misalignment.
  • Value Calibration (Orange line): This is the proposed VCB method. It ensures the probability distribution actually mirrors the reward distribution. Since \(y_2\) and \(y_3\) have similar scores, they get similar probabilities.

Unifying the Old Methods

Before we build the new method, we need to understand why the current methods ignore values. The authors provide a fascinating theoretical unification of RRHF, SLiC, and DPO.

It turns out, all these methods stem from the same optimization problem:

Equation 3 optimization problem

Here, the goal is to maximize the expected reward \(r(x,y)\) plus a “Generalized Conditional Entropy” term \(H\).

If we solve this optimization problem theoretically, we get an optimal policy \(\pi_{opt}\) that looks like this:

Optimal policy equation

The Villain: The Partition Function (\(Z(x)\))

See that \(Z(x)\) in the denominator? That is the partition function. It guarantees that all probabilities sum to 1. Calculating \(Z(x)\) requires summing over every possible sentence the LLM could generate, which is computationally impossible.

To get around this, methods like DPO use a clever mathematical trick called reparameterization. They rearrange the equation to define the reward \(r(x,y)\) in terms of the optimal policy and \(Z(x)\):

Rearranged reward equation

Then, they plug this into a ranking loss function. For example, SLiC uses a margin loss:

SLiC reward loss

When you substitute the reparameterized reward into this loss, the \(Z(x)\) terms cancel out! (Because \(Z(x)\) depends only on the prompt, not the specific response \(y\)).

SLiC final loss

The Cost of Efficiency: By canceling out \(Z(x)\), they also effectively removed the absolute reward term \(r(x,y)\) from the equation. The loss function now only sees the probabilities of the model. The actual “score” (e.g., 0.9 vs 0.1) is gone. We are left with a clean, differentiable loss function, but we have lost the map of “how much better” things are.

The Solution: Value-based CaliBration (VCB)

The researchers propose a method that eliminates the impossible-to-calculate \(Z(x)\) without deleting the reward values.

Step 1: A Better Entropy

First, they choose a specific form of entropy based on KL-divergence. This keeps the model from drifting too far from the base Supervised Fine-Tuned (SFT) model (a standard practice in alignment to prevent “reward hacking” or gibberish generation).

Entropy definition using KL divergence

This leads to a specific form of the optimal policy:

Optimal policy with KL entropy

Step 2: The Difference Method

Here is the key innovation. Instead of substituting variables to hide \(Z(x)\), the authors use a difference method.

They take the log of the optimal policy for two different responses, \(y_1\) and \(y_2\), and subtract them.

Difference method derivation

When you subtract the equation for response 1 from the equation for response 2, the \(Z(x)\) term—which is identical for both—simply vanishes.

Final difference equation

Look at that equation above.

  • Left Side: The difference in “log-probability gaps” between the optimized model and the SFT model.
  • Right Side: The difference in actual rewards.

This implies that if \(y_1\) has a much higher reward than \(y_2\), the model’s probability for \(y_1\) should increase proportionally more.

Step 3: The VCB Loss Function

Based on this equality, the authors define the Value-based CaliBration (VCB) loss. It tries to minimize the squared error between the probability gap and the reward gap.

VCB Loss Function

Let’s break down the components of this loss function:

  1. \(\pi(y|x)\): The probability the model assigns to the response.
  2. \(\pi_{sft}(y|x)\): The probability the original base model assigned.
  3. \(r(x,y)\): The actual reward score from the reward model.
  4. \(\sigma_{sft}^r(x)\): A normalization term (standard deviation of rewards for this prompt). This is crucial because some prompts naturally have high variance in answer quality, while others don’t. Normalizing makes training stable.

Visualizing the Intuition

To understand what this loss function is actually doing geometrically, look at Figure 2:

Illustration of deltas

  • \(\Delta^{\pi}_{y1}\): How much the model has “learned” (shifted probability) for response \(y_1\).
  • \(\Delta^{\pi}_{y2}\): How much the model has learned for response \(y_2\).
  • \(\Delta^r_{y1, y2}\): The actual quality difference (reward gap) between the two.

The loss forces the difference in learning (\(\Delta^{\pi}_{y1} - \Delta^{\pi}_{y2}\)) to equal the difference in quality (\(\Delta^r_{y1, y2}\)).

If \(y_1\) is massively better than \(y_2\), the model must drastically increase the probability of \(y_1\) relative to \(y_2\). If they are nearly the same quality, the probability shifts should be nearly identical.

The Training Pipeline

How does this look in practice? The pipeline (Figure 3) is a standard three-step process, very similar to other alignment workflows but with the specific VCB loss at the end.

The training pipeline of the proposed value-based calibration method.

  1. SFT & Reward Modeling: Train a standard SFT model on good data. Train a Reward Model on preference data (human rankings).
  2. Data Generation: Take a prompt \(x\), use the SFT model to generate \(n\) candidate responses (\(y_1...y_n\)), and score them all with the Reward Model.
  3. Value Calibration: Train the final policy using the VCB loss function on these scored responses.

The final loss calculation iterates over the generated pairs using a LogSumExp trick (controlled by \(\lambda\)) to focus on “hard” examples—pairs where the model’s current ranking is most wrong compared to the reward values.

Final Loss calculation

Experiments and Results

Does adding reward magnitude actually help? The researchers tested VCB on two standard benchmarks:

  1. Anthropic HH: A dialogue dataset (Helpful and Harmless).
  2. Reddit TL;DR: A summarization dataset.

They compared VCB against the heavy hitters: PPO, DPO, RRHF, SLiC, and IPO.

Win Rates (GPT-4 Evaluation)

They used GPT-4 as an impartial judge to compare VCB’s outputs against the baselines.

GPT-4 evaluation results

The results in Figure 4 are stark:

  • Anthropic HH (Left): VCB wins against every baseline. Against DPO (the current state-of-the-art), VCB wins/ties in the majority of cases.
  • Reddit TL;DR (Right): The dominance is even clearer. VCB crushes the SFT baseline and maintains a solid lead over PPO and DPO.

Reward Model Evaluation

If the goal is to maximize reward, how did the models score according to the reward model itself?

Reward model evaluation results

Table 2 shows the “Win Rate” against baselines based on the reward model’s score. VCB consistently beats other methods. For example, on the Reddit dataset, VCB has an 86.8% win rate against the SFT model, compared to DPO which is often lower in comparable studies.

The “Tax” of Alignment: KL Divergence

In alignment, there is always a trade-off. You can get higher rewards by “hacking” the metric (e.g., repeating words the reward model likes), but this destroys the fluency of the text. We measure this deviation using KL Divergence. You want high rewards with low KL.

Expected reward vs KL divergence

Figure 5 plots Reward (Y-axis) vs. KL Divergence (X-axis).

  • Pink (PPO): Generates high rewards but at a very high KL cost (far right). The model is changing too much.
  • Green (DPO): Better, but still drifts significantly.
  • Orange (VCB): This is the sweet spot. VCB achieves rewards comparable to or better than PPO/DPO but stays much closer to the original SFT model (left side of the graph). This implies VCB aligns the model efficiently without breaking its natural language capabilities.

Out-of-Distribution Generalization

Finally, a truly robust model should work on data it hasn’t seen. They took the model trained on Reddit summaries and tested it on CNN/DailyMail news articles.

Out-of-distribution experimental results

As shown in Table 5, VCB generalizes incredibly well, beating DPO with a 54.3% win rate in this challenging Out-Of-Distribution (OOD) setting. This suggests that by learning the values rather than just the rankings, the model learns a more robust understanding of what makes a summary “good.”

Conclusion: Value Matters

The transition from PPO to methods like DPO was a massive leap forward for the simplicity of training LLMs. However, this paper argues convincingly that in our quest for simplicity, we threw out the baby with the bathwater: the reward values themselves.

Key Takeaways:

  1. Order isn’t everything: Knowing A > B is less useful than knowing A »> B.
  2. Reparameterization hides value: Traditional derivations for DPO/RRHF eliminate the reward term to solve the partition function problem.
  3. Difference methods work: VCB uses a difference-based loss to remove the partition function while preserving the absolute reward signal.
  4. Better Efficiency: VCB achieves high alignment scores with less deviation (KL divergence) from the base model.

As we continue to fine-tune larger models, efficiency and precision are paramount. VCB offers a mathematically sound way to squeeze more signal out of our reward models, ensuring that when an LLM is aligned, it doesn’t just know what is better—it knows how much better.