Introduction

In the rapidly evolving world of Large Language Models (LLMs), bigger isn’t always better—especially when it comes to the length of the model’s response. If you have interacted with modern chatbots, you might have noticed a peculiar habit: they love to ramble. Ask a simple question, and you often get a wall of text.

This phenomenon is known as verbosity. While we want models to be thorough, we don’t want them to confuse “long” with “correct.”

Currently, the industry standard for aligning models to human preferences is Direct Preference Optimization (DPO). It is elegant, stable, and generally effective. However, DPO has a hidden flaw: it suffers from an inherent algorithmic bias that encourages the model to cheat by generating longer responses to maximize its reward score.

In this post, we are doing a deep dive into a fascinating paper titled “Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence.” The authors identify exactly why DPO prefers verbosity and propose a clever, lightweight solution called SamPO (Sampled Preference Optimization) to fix it.

Background: The Road to DPO

To understand the solution, we first need to understand the problem. How do we teach LLMs what humans like?

From RLHF to DPO

Originally, we used Reinforcement Learning from Human Feedback (RLHF). This was a complex, multi-stage process:

Train a Reward Model to mimic human preferences.
Use that Reward Model to train the LLM using reinforcement learning (like PPO).

While effective, RLHF is computationally expensive and unstable. Enter DPO (Direct Preference Optimization). DPO simplified the game by removing the separate Reward Model entirely. Instead, it uses a mathematical trick to optimize the model directly on preference data (pairs of “chosen” and “rejected” responses).

The Implicit Reward

DPO doesn’t give the model a score like “8 out of 10.” Instead, it calculates an implicit reward. This reward is based on the probability of the model generating the “chosen” response compared to a reference model (usually the original base model).

The core of DPO relies on the Bradley-Terry model, which estimates the probability that one response is better than another. The loss function looks like this:

The DPO Loss Function based on expectation.

Here, \(\Delta\) represents the difference in “implicit rewards” between the chosen response (\(y_w\)) and the rejected response (\(y_l\)).

The formula for Delta, representing the difference in implicit rewards.

The larger this gap (\(\Delta\)), the more the model learns to prefer the chosen response.

The Core Problem: Algorithmic Length Bias

Here is where the paper makes its key contribution. Previous research assumed that models were verbose because the training data was biased (i.e., human annotators simply prefer longer answers). While true, the authors argue that DPO itself is mathematically biased toward length.

Decomposing the Reward

To see why, we have to look at how that implicit reward is actually calculated. The reward is essentially the KL divergence (a measure of difference between probability distributions) summed up across the entire sequence of text.

If we break down the equation to the token level, it looks like this:

Equation showing the decomposition of KL divergence at the token level.

Notice the summation symbols (\(\Sigma\)). The reward is the sum of log-probabilities for every token from \(t=1\) to \(T\).

Here is the catch: If the “chosen” response (\(T_w\)) is significantly longer than the “rejected” response (\(T_l\)), the summation for the chosen response has more terms. It accumulates more value simply because it has more tokens.

Overestimation: If the chosen response is longer, DPO overestimates its quality because the reward sum is inflated by length.
Underestimation: If the chosen response is shorter (even if it’s concise and correct), DPO underestimates its reward.

This incentivizes the model to generate longer sequences to hack the reward function.

Visualizing the Bias

The authors illustrate this beautifully in the figure below.

Figure 2: The inequality of pairwise responses forces DPO to overestimate or underestimate actual rewards.

Top (a): Look at the red and purple lines. These are the accumulated rewards. Because the “Chosen” sequence is longer, the red line keeps climbing, creating a massive gap (reward) just by virtue of existing longer.
Bottom Left (b): If you try to normalize by averaging (DPO-SANorm), you lose the nuance of the token-level variance.
Bottom Right (c): This is the proposed solution (SamPO), which we will discuss next. Notice how the reward accumulation is more balanced.

The Solution: SamPO (Down-Sampled DPO)

To eliminate this length dependency, the authors introduce SamPO. The idea is intuitive: To compare two runners fairly, they must run the same distance.

If the chosen response is 100 tokens long and the rejected response is 50 tokens long, we shouldn’t compare the sum of 100 tokens against the sum of 50.

How SamPO Works

SamPO modifies the loss calculation by down-sampling the longer sequence.

Identify the length of the chosen response (\(T_w\)) and the rejected response (\(T_l\)).
Find the minimum length: \(T_m = \min(T_w, T_l)\).
Randomly sample \(T_m\) tokens from both sequences (or just the longer one to match the shorter one).
Calculate the implicit reward using only these sampled tokens.

The new formula looks like this:

The SamPO equation showing the summation up to Tm, where Tm is the minimum length.

By ensuring the number of terms in the summation is identical (\(T_m\)), the “length” variable is removed from the equation. The model can no longer increase its reward simply by adding more tokens; it must increase the quality (probability) of the tokens it generates.

Top-K vs. Random Sampling

You might wonder: why choose random tokens? Why not choose the “best” tokens (Top-K) with the highest probabilities?

The authors actually tested this. Interestingly, random sampling works better. As shown below, Top-K sampling (Figure 6) creates an artificial gap that might be too aggressive, whereas random sampling preserves the natural distribution of the text while correcting for length.

Figure 6: Comparing Random K vs Top K down-sampling.

Experiments and Results

Does SamPO actually work? The authors tested this across multiple models (Pythia, Llama3, Tulu2) and benchmarks (GSM8K, AlpacaEval, etc.).

1. Proving the Bias Exists

First, they performed a “sanity check” experiment. They split the UltraFeedback dataset into two subsets:

Long: Pairs where the chosen response is longer.
Short: Pairs where the chosen response is shorter.

They trained a standard DPO model on these subsets.

Figure 3: Trends of implicit reward when fine-tuning with different subsets.

The “Short” subset (Blue line): When DPO is trained on data where the winner is shorter, the rewards collapse (trend downward). DPO struggles to learn that “shorter is better” because its math fights against it.
The SamPO fix (Table 1): SamPO stabilizes this. Even on the “Short” dataset, SamPO achieves a decent average benchmark score (49.73), whereas standard DPO collapses (40.59).

2. Comparing Performance

The authors compared SamPO against DPO and other recent methods like SimPO (which uses length normalization) and TDPO.

The results, visualized in Figure 1, show a clear trend:

Figure 1: Down-Sampling strategy helps mitigate length reliance and improves DPO.

SamPO consistently outperforms standard DPO across benchmarks. For example:

GSM8K (Math): SamPO achieves 77.81% vs. DPO’s 75.59%.
AlpacaEval 2 (General Chat): SamPO achieves a win rate of 35.14% vs. DPO’s 23.2% (using length-controlled evaluation).

3. The Verbosity Analysis

This is the most critical result. Did the model stop rambling?

The graph below tracks the average output length (Y-axis) against the performance score (X-axis) over 3 epochs of training.

Figure 4: Policy model response length vs test performance over 3 epochs.

Pink Triangles (DPO): Notice the sharp upward trajectory. As training progresses, the model gets slightly better but much longer. It is learning to ramble.
Blue Squares (Iterative SamPO): The curve moves to the right (better performance) but stays relatively flat on the Y-axis. The model improves its quality without inflating its length.

4. Comprehensive Benchmarks

In a detailed comparison using the Llama3-8B model, SamPO demonstrated it could maintain the delicate balance between brevity and information.

Table 2: Qualitative results of fine-tuning two LLMs.

Looking at Table 2, we see that SimPO (another method for controlling length) often reduces length too much.

SimPO: Greatly reduces length (Length/Token ~375) but sometimes hurts performance on reasoning tasks like GSM8K (72.93).
SamPO: Maintains a healthy length (Length/Token ~375-377) but achieves the highest scores on GSM8K (77.81).

This suggests that SamPO doesn’t just “cut” text; it encourages the model to be efficient.

5. Other Datasets (HH-RLHF and TL;DR)

The authors also tested on the HH-RLHF (Helpful & Harmless) and TL;DR (Summarization) datasets.

Table 3: Win Rate and Output Length on HH-RLHF and TL;DR.

On TL;DR (summarization), brevity is key.

DPO had a win rate of 60.98%.
Iterative SamPO jumped to 73.58%.

This proves that for tasks requiring concise answers, eliminating length bias is a massive advantage.

Conclusion

The “verbosity” problem in LLMs is not just a quirk of training data; it is a mathematical artifact of how we calculate rewards in DPO. By summing log-probabilities over unequal lengths, we inadvertently teach models that “more is better.”

SamPO provides an elegant, code-level fix. By down-sampling the token sequences to equal lengths during the loss calculation, it forces the model to compete on quality rather than quantity.

The implications are significant:

Efficiency: Shorter, correct answers save computational costs during inference.
User Experience: Users get direct answers rather than rambling essays.
Alignment: We can align models to true human preferences, not just surface-level features like length.

As we move toward more autonomous and efficient AI agents, methods like SamPO will be essential in ensuring our models speak not just more, but better.

Introduction#

Background: The Road to DPO#

From RLHF to DPO#

The Implicit Reward#

The Core Problem: Algorithmic Length Bias#

Decomposing the Reward#

Visualizing the Bias#

The Solution: SamPO (Down-Sampled DPO)#

How SamPO Works#

Top-K vs. Random Sampling#

Experiments and Results#

1. Proving the Bias Exists#

2. Comparing Performance#

3. The Verbosity Analysis#

4. Comprehensive Benchmarks#

5. Other Datasets (HH-RLHF and TL;DR)#

Conclusion#