The training of Large Language Models (LLMs) has evolved into a sophisticated three-stage pipeline: Pretraining (learning the language), Supervised Fine-Tuning (learning the task), and Reinforcement Learning with Human Feedback (RLHF). While the first two stages build capability, the third stage—RLHF—is arguably the most critical for safety and utility. It aligns the model with human values, ensuring the AI is helpful rather than harmful.

Recently, Direct Preference Optimization (DPO) has emerged as a popular alternative to traditional RLHF methods like PPO. DPO simplifies the process by treating alignment as a classification problem. However, like many classification tasks, DPO suffers from noisy data. Humans don’t always agree on which response is better, leading to inconsistencies in the training labels.

The standard fix for this is Label Smoothing—a technique where we tell the model, “Don’t be 100% sure about this label; be 90% sure.” But here lies the problem: How do we choose that smoothing parameter? Historically, researchers have just guessed a number or tuned it heuristically.

In this post, we will dive deep into a new research paper, Enhancing Language Model Alignment: A Confidence-Based Approach to Label Smoothing, which proposes a principled, mathematically grounded method called Confidence Aware Label Smoothing (CALS). We will explore how CALS dynamically adjusts smoothing based on how confident we should be in the data, leading to better-aligned models.

The Background: RLHF and the Need for Smoothing

To understand CALS, we first need to revisit how we align models using preference data. In a typical setup, we have a dataset of prompts (\(x\)) and pairs of responses (\(y^+\) and \(y^-\)), where \(y^+\) is preferred by a human over \(y^-\).

The Bradley-Terry Model

Underlying almost all modern alignment work is the Bradley-Terry model. It assumes that there exists a “true” reward function \(r^*(x, y)\) that reflects human preference. The probability that \(y_1\) is preferred over \(y_2\) is modeled as a sigmoid function of the difference in their rewards.

The Bradley-Terry model probability equation.

In traditional RLHF, we would train a reward model to approximate \(r^*\) by minimizing a loss function based on this probability.

The reward modeling loss function.

Once the reward model is trained, we would use Reinforcement Learning (PPO) to train a policy \(\pi_\theta\) to maximize that reward, while staying close to the original model (\(\pi_{sft}\)) to prevent “reward hacking” or mode collapse.

The RLHF objective function with KL divergence penalty.

Enter DPO: Alignment as Classification

Direct Preference Optimization (DPO) revolutionized this field by showing that you don’t need a separate reward model. You can optimize the policy directly using a binary cross-entropy (BCE) loss. DPO effectively re-parameterizes the reward in terms of the optimal policy:

The relationship between reward and optimal policy in DPO.

By plugging this into the loss function, DPO allows us to train the LLM by simply minimizing a classification loss on preference pairs.

The DPO loss function. The continuation of the DPO loss function equation.

The Problem with Hard Labels

In the equations above, the training process assumes “hard labels.” If a human picks Response A over Response B, the model treats Response A as the absolute correct answer (probability 1.0).

But human preferences are noisy. If two annotators look at the same pair, they might disagree. Or the two responses might be nearly identical in quality. Treating these preferences as absolute facts can cause the model to overfit to noise.

Label smoothing addresses this. Instead of targeting a probability of 1.0 for the winner and 0.0 for the loser, we might target \(1 - \alpha\) and \(\alpha\). This prevents the model from becoming overconfident.

The standard Binary Cross-Entropy (BCE) loss with label smoothing looks like this:

The BCE loss with label smoothing parameter alpha.

The critical question the researchers asked is: Why use a constant \(\alpha\)? If a preference is obvious (e.g., a safe response vs. a toxic one), we should trust the label (low \(\alpha\)). If the preference is ambiguous, we should smooth it significantly (high \(\alpha\)).

Theoretical Foundation: Optimizing Gradient Estimation

The authors tackled this problem by analyzing the gradients. When we train a model, we are estimating the gradient of the loss. The label smoothing parameter \(\alpha\) controls a trade-off:

  1. Bias: How far is our estimated gradient from the “true” gradient (the one we’d get if we knew the perfect ground-truth probability of preference)?
  2. Variance: How much does the gradient jump around due to data noise?

If \(\alpha = 0\), we have an unbiased estimator, but potentially high variance. If \(\alpha = 0.5\), we have zero variance (the gradient is zero), but massive bias.

Theorem 3.1: The Optimal \(\alpha\)

The researchers proved a theorem characterizing the optimal \(\alpha\) that minimizes the expected error of the gradient estimate. The result is fascinating: the optimal smoothing parameter is not a constant. It depends on \(q\), which represents \(p^*(w)\)—the true confidence of the preference label.

The equation defining the optimal alpha based on distance metrics.

Depending on how you measure distance (the metric), the optimal \(\alpha\) changes, but it always relies on the underlying confidence.

For the \(\ell_0\) metric (minimizing the chance that the gradient points in the wrong direction), the optimal \(\alpha\) is:

Optimal alpha for L0 metric.

For the \(\ell_2\) metric (minimizing the Euclidean distance error), the optimal \(\alpha\) is:

Optimal alpha for L2 metric.

This relationship is visualized below. The x-axis (\(q\)) is the confidence. If \(q\) is 0 or 1 (we are certain about the preference), the optimal smoothing (\(\alpha^*\)) is 0. If \(q\) is 0.5 (total ambiguity), the optimal smoothing is 0.5 (maximum smoothing).

Graph showing optimal label smoothing parameter as a function of confidence q.

This proves theoretically that label smoothing should be confidence-aware.

The Method: Confidence Aware Label Smoothing (CALS)

There is a catch. The theorem says optimal smoothing depends on \(p^*(w)\)—the true probability that response A is better than B. We don’t know this probability. We only have the noisy label \(z \in \{0, 1\}\).

To solve this, the authors introduce Confidence Aware Label Smoothing (CALS). The core idea is to estimate the confidence iteratively as the model trains.

Concept: Calibrating on the Fly

CALS relies on the concept of calibration. If a model predicts an event with probability 0.8, that event should ideally occur 80% of the time.

The authors propose defining the smoothing parameter \(\tilde{\alpha}\) based on the model’s own prediction confidence. Specifically, they look at how often the model’s prediction aligns with the human label.

Definition of alpha tilde based on conditional probability.

In simpler terms:

  1. Group data points where the model predicts the win probability is roughly \(x\) (e.g., 0.7).
  2. Look at the actual labels for those points. Did the preferred response actually win 70% of the time?
  3. If the model is overconfident (predicts 0.9 but labels only agree 0.6 of the time), we increase the smoothing parameter to dampen the signal.

The Algorithm

The practical implementation of CALS involves discretizing the probability space into bins (buckets).

The binning strategy for probability.

During training, the algorithm maintains an estimate of the “correctness” for each bin. It updates these estimates on the fly using a moving average.

The update rule for alpha hat k.

This creates a dynamic feedback loop. The smoothing parameter for a specific data point is determined by which bin the model’s current prediction falls into.

The definition of the CALS loss function.

Visualizing the Dynamics

This dynamic adjustment changes how the model updates its weights.

In standard DPO (Figure 1a below), the update speed is driven only by the model’s current error. In CALS (Figure 1b below), we add a new dimension.

  • Green Star (Fast): The model is incorrect (needs update) AND the label is high-confidence. Update Fast.
  • Red Star (Slow): The label is low-confidence (ambiguous). Even if the model is “wrong,” we shouldn’t update aggressively because the ground truth is shaky. Update Slow.

Comparison of update dynamics between DPO and CALS. Detailed schematic of CALS update dynamics based on confidence and correctness.

Equilibrium Analysis

The authors also analyzed what happens when training converges (equilibrium). Because CALS smooths labels based on confidence, the model settles into a conservative equilibrium.

The equilibrium equation for CALS.

As shown in the figure below, the learned probability \(\tilde{p}(w)\) (y-axis) is slightly squashed towards 0.5 compared to the true probability \(p^*(w)\) (x-axis). This conservatism is a desirable property in safety alignment, as it prevents the model from becoming overconfident about ambiguous preferences.

Graph of equilibrium probability vs true probability showing conservative behavior.

Experimental Results

The theory sounds solid, but does it work? The authors tested CALS on both a controlled logistic regression task and large-scale LLM alignment.

Logistic Regression (Sanity Check)

First, they tested the method on a high-dimensional logistic regression problem where the ground truth was known. They compared three methods:

  1. MLE: Standard Maximum Likelihood (no smoothing).
  2. MLE-CALS-2: CALS using the \(\ell_2\)-optimal strategy.
  3. MLE-CALS-0: CALS using the \(\ell_0\)-optimal strategy.

Loss curves for logistic regression experiments.

The results (Figure 4) show that MLE-CALS-0 (the purple dashed line) consistently achieves the lowest test loss across different dimensions (\(d\)) and training set sizes. This validates the theory that adaptive smoothing reduces estimation error.

LLM Alignment: Zephyr-7B and StarChat-15B

The real test was on open-ended text generation. They finetuned Zephyr-7B and StarChat2-15B using the UltraFeedback dataset. They compared standard DPO against DPO equipped with CALS.

Evaluation Metrics: To judge the quality of the responses, they used two powerful judges:

  1. GPT-4: Prompted to pick the better response (Win Rate).
  2. ArmoRM: A state-of-the-art Reward Model from the RewardBench leaderboard.

The Results:

On GPT-4 evaluations (Figure 5), CALS outperformed the baseline DPO on both models. The “Win” bars (dark blue) are consistently larger than the “Lose” bars.

Bar chart showing GPT-4 win rates for CALS vs Baseline.

The results were even more pronounced with the ArmoRM evaluator (Figure 6), showing a clear preference for the CALS-trained models.

Bar chart showing ArmoRM win rates for CALS vs Baseline.

Robustness to Initialization

One might wonder: “Does CALS only work because it starts with a good smoothing parameter?” The authors tested this by initializing the training with different baseline smoothing values (\(\alpha \in \{0.8, 0.9, 1.0\}\)).

Regardless of the starting point, CALS consistently improved performance. This suggests the method is robust and capable of correcting suboptimal hyperparameter choices on its own.

GPT-4 win rates under different initialization parameters.

Conclusion and Implications

The “Confidence Aware Label Smoothing” paper highlights a crucial nuance in training LLMs: not all training data is created equal.

By treating every preference label as equally certain, standard DPO leaves performance on the table. CALS provides a mechanism to “listen” to the data more carefully. When the model detects ambiguity (via calibration), it smooths the label, effectively telling the optimizer to be cautious. When the signal is clear, it sharpens the label, allowing for faster learning.

Key Takeaways for Students:

  1. Gradients Matter: Understanding the bias-variance tradeoff in your gradient estimator can lead to better loss functions.
  2. Adaptive is Better: Heuristic constants (like fixed label smoothing) are rarely optimal. Making parameters data-dependent often yields gains.
  3. Calibration: The alignment between a model’s predicted probability and empirical accuracy is a powerful signal that can be used to stabilize training.

As LLMs continue to scale, techniques like CALS that improve data efficiency and alignment stability will become increasingly important. It moves us away from “magic numbers” in hyperparameters toward principled, self-adjusting training dynamics.