The training of Large Language Models (LLMs) has evolved into a sophisticated three-stage pipeline: Pretraining (learning the language), Supervised Fine-Tuning (learning the task), and Reinforcement Learning with Human Feedback (RLHF). While the first two stages build capability, the third stage—RLHF—is arguably the most critical for safety and utility. It aligns the model with human values, ensuring the AI is helpful rather than harmful.
Recently, Direct Preference Optimization (DPO) has emerged as a popular alternative to traditional RLHF methods like PPO. DPO simplifies the process by treating alignment as a classification problem. However, like many classification tasks, DPO suffers from noisy data. Humans don’t always agree on which response is better, leading to inconsistencies in the training labels.
The standard fix for this is Label Smoothing—a technique where we tell the model, “Don’t be 100% sure about this label; be 90% sure.” But here lies the problem: How do we choose that smoothing parameter? Historically, researchers have just guessed a number or tuned it heuristically.
In this post, we will dive deep into a new research paper, Enhancing Language Model Alignment: A Confidence-Based Approach to Label Smoothing, which proposes a principled, mathematically grounded method called Confidence Aware Label Smoothing (CALS). We will explore how CALS dynamically adjusts smoothing based on how confident we should be in the data, leading to better-aligned models.
The Background: RLHF and the Need for Smoothing
To understand CALS, we first need to revisit how we align models using preference data. In a typical setup, we have a dataset of prompts (\(x\)) and pairs of responses (\(y^+\) and \(y^-\)), where \(y^+\) is preferred by a human over \(y^-\).
The Bradley-Terry Model
Underlying almost all modern alignment work is the Bradley-Terry model. It assumes that there exists a “true” reward function \(r^*(x, y)\) that reflects human preference. The probability that \(y_1\) is preferred over \(y_2\) is modeled as a sigmoid function of the difference in their rewards.

In traditional RLHF, we would train a reward model to approximate \(r^*\) by minimizing a loss function based on this probability.

Once the reward model is trained, we would use Reinforcement Learning (PPO) to train a policy \(\pi_\theta\) to maximize that reward, while staying close to the original model (\(\pi_{sft}\)) to prevent “reward hacking” or mode collapse.

Enter DPO: Alignment as Classification
Direct Preference Optimization (DPO) revolutionized this field by showing that you don’t need a separate reward model. You can optimize the policy directly using a binary cross-entropy (BCE) loss. DPO effectively re-parameterizes the reward in terms of the optimal policy:

By plugging this into the loss function, DPO allows us to train the LLM by simply minimizing a classification loss on preference pairs.

The Problem with Hard Labels
In the equations above, the training process assumes “hard labels.” If a human picks Response A over Response B, the model treats Response A as the absolute correct answer (probability 1.0).
But human preferences are noisy. If two annotators look at the same pair, they might disagree. Or the two responses might be nearly identical in quality. Treating these preferences as absolute facts can cause the model to overfit to noise.
Label smoothing addresses this. Instead of targeting a probability of 1.0 for the winner and 0.0 for the loser, we might target \(1 - \alpha\) and \(\alpha\). This prevents the model from becoming overconfident.
The standard Binary Cross-Entropy (BCE) loss with label smoothing looks like this:

The critical question the researchers asked is: Why use a constant \(\alpha\)? If a preference is obvious (e.g., a safe response vs. a toxic one), we should trust the label (low \(\alpha\)). If the preference is ambiguous, we should smooth it significantly (high \(\alpha\)).
Theoretical Foundation: Optimizing Gradient Estimation
The authors tackled this problem by analyzing the gradients. When we train a model, we are estimating the gradient of the loss. The label smoothing parameter \(\alpha\) controls a trade-off:
- Bias: How far is our estimated gradient from the “true” gradient (the one we’d get if we knew the perfect ground-truth probability of preference)?
- Variance: How much does the gradient jump around due to data noise?
If \(\alpha = 0\), we have an unbiased estimator, but potentially high variance. If \(\alpha = 0.5\), we have zero variance (the gradient is zero), but massive bias.
Theorem 3.1: The Optimal \(\alpha\)
The researchers proved a theorem characterizing the optimal \(\alpha\) that minimizes the expected error of the gradient estimate. The result is fascinating: the optimal smoothing parameter is not a constant. It depends on \(q\), which represents \(p^*(w)\)—the true confidence of the preference label.

Depending on how you measure distance (the metric), the optimal \(\alpha\) changes, but it always relies on the underlying confidence.
For the \(\ell_0\) metric (minimizing the chance that the gradient points in the wrong direction), the optimal \(\alpha\) is:

For the \(\ell_2\) metric (minimizing the Euclidean distance error), the optimal \(\alpha\) is:

This relationship is visualized below. The x-axis (\(q\)) is the confidence. If \(q\) is 0 or 1 (we are certain about the preference), the optimal smoothing (\(\alpha^*\)) is 0. If \(q\) is 0.5 (total ambiguity), the optimal smoothing is 0.5 (maximum smoothing).

This proves theoretically that label smoothing should be confidence-aware.
The Method: Confidence Aware Label Smoothing (CALS)
There is a catch. The theorem says optimal smoothing depends on \(p^*(w)\)—the true probability that response A is better than B. We don’t know this probability. We only have the noisy label \(z \in \{0, 1\}\).
To solve this, the authors introduce Confidence Aware Label Smoothing (CALS). The core idea is to estimate the confidence iteratively as the model trains.
Concept: Calibrating on the Fly
CALS relies on the concept of calibration. If a model predicts an event with probability 0.8, that event should ideally occur 80% of the time.
The authors propose defining the smoothing parameter \(\tilde{\alpha}\) based on the model’s own prediction confidence. Specifically, they look at how often the model’s prediction aligns with the human label.

In simpler terms:
- Group data points where the model predicts the win probability is roughly \(x\) (e.g., 0.7).
- Look at the actual labels for those points. Did the preferred response actually win 70% of the time?
- If the model is overconfident (predicts 0.9 but labels only agree 0.6 of the time), we increase the smoothing parameter to dampen the signal.
The Algorithm
The practical implementation of CALS involves discretizing the probability space into bins (buckets).

During training, the algorithm maintains an estimate of the “correctness” for each bin. It updates these estimates on the fly using a moving average.

This creates a dynamic feedback loop. The smoothing parameter for a specific data point is determined by which bin the model’s current prediction falls into.

Visualizing the Dynamics
This dynamic adjustment changes how the model updates its weights.
In standard DPO (Figure 1a below), the update speed is driven only by the model’s current error. In CALS (Figure 1b below), we add a new dimension.
- Green Star (Fast): The model is incorrect (needs update) AND the label is high-confidence. Update Fast.
- Red Star (Slow): The label is low-confidence (ambiguous). Even if the model is “wrong,” we shouldn’t update aggressively because the ground truth is shaky. Update Slow.

Equilibrium Analysis
The authors also analyzed what happens when training converges (equilibrium). Because CALS smooths labels based on confidence, the model settles into a conservative equilibrium.

As shown in the figure below, the learned probability \(\tilde{p}(w)\) (y-axis) is slightly squashed towards 0.5 compared to the true probability \(p^*(w)\) (x-axis). This conservatism is a desirable property in safety alignment, as it prevents the model from becoming overconfident about ambiguous preferences.

Experimental Results
The theory sounds solid, but does it work? The authors tested CALS on both a controlled logistic regression task and large-scale LLM alignment.
Logistic Regression (Sanity Check)
First, they tested the method on a high-dimensional logistic regression problem where the ground truth was known. They compared three methods:
- MLE: Standard Maximum Likelihood (no smoothing).
- MLE-CALS-2: CALS using the \(\ell_2\)-optimal strategy.
- MLE-CALS-0: CALS using the \(\ell_0\)-optimal strategy.

The results (Figure 4) show that MLE-CALS-0 (the purple dashed line) consistently achieves the lowest test loss across different dimensions (\(d\)) and training set sizes. This validates the theory that adaptive smoothing reduces estimation error.
LLM Alignment: Zephyr-7B and StarChat-15B
The real test was on open-ended text generation. They finetuned Zephyr-7B and StarChat2-15B using the UltraFeedback dataset. They compared standard DPO against DPO equipped with CALS.
Evaluation Metrics: To judge the quality of the responses, they used two powerful judges:
- GPT-4: Prompted to pick the better response (Win Rate).
- ArmoRM: A state-of-the-art Reward Model from the RewardBench leaderboard.
The Results:
On GPT-4 evaluations (Figure 5), CALS outperformed the baseline DPO on both models. The “Win” bars (dark blue) are consistently larger than the “Lose” bars.

The results were even more pronounced with the ArmoRM evaluator (Figure 6), showing a clear preference for the CALS-trained models.

Robustness to Initialization
One might wonder: “Does CALS only work because it starts with a good smoothing parameter?” The authors tested this by initializing the training with different baseline smoothing values (\(\alpha \in \{0.8, 0.9, 1.0\}\)).
Regardless of the starting point, CALS consistently improved performance. This suggests the method is robust and capable of correcting suboptimal hyperparameter choices on its own.

Conclusion and Implications
The “Confidence Aware Label Smoothing” paper highlights a crucial nuance in training LLMs: not all training data is created equal.
By treating every preference label as equally certain, standard DPO leaves performance on the table. CALS provides a mechanism to “listen” to the data more carefully. When the model detects ambiguity (via calibration), it smooths the label, effectively telling the optimizer to be cautious. When the signal is clear, it sharpens the label, allowing for faster learning.
Key Takeaways for Students:
- Gradients Matter: Understanding the bias-variance tradeoff in your gradient estimator can lead to better loss functions.
- Adaptive is Better: Heuristic constants (like fixed label smoothing) are rarely optimal. Making parameters data-dependent often yields gains.
- Calibration: The alignment between a model’s predicted probability and empirical accuracy is a powerful signal that can be used to stabilize training.
As LLMs continue to scale, techniques like CALS that improve data efficiency and alignment stability will become increasingly important. It moves us away from “magic numbers” in hyperparameters toward principled, self-adjusting training dynamics.
](https://deep-paper.org/en/paper/file-3032/images/cover.png)