Reinforcement Learning from Human Feedback (RLHF) has established itself as the standard for aligning Large Language Models (LLMs) with human intent. While the traditional PPO-based pipeline was effective, it was also computationally expensive and unstable. The arrival of Direct Preference Optimization (DPO) changed the landscape by treating the language model itself as the reward model, streamlining the alignment process significantly.
However, DPO is not without its flaws. The standard intuition in machine learning is to focus on “hard” examples—the cases where the model makes mistakes. DPO’s loss function inherently places high weight on preference pairs where the model incorrectly ranks the rejected response higher than the chosen one.
But what if this intuition is wrong for LLM alignment? Recent research suggests that DPO rarely succeeds in fixing these “hard” misranked pairs, effectively wasting training cycles.
In this post, we explore FocalPO, a new approach presented by Liu et al. that fundamentally flips the script. Instead of obsessing over mistakes, FocalPO explicitly down-weighs the confusing pairs and prioritizes enhancing the model’s understanding of pairs it can already rank correctly. We will dive into the mathematics of why this counter-intuitive approach works, derive the gradient behavior, and look at the empirical results.
Background: The DPO Landscape
To understand FocalPO, we must first revisit the mechanics of Direct Preference Optimization (DPO).
In a typical preference dataset \(\mathcal{D}\), we have triplets \((x, y_w, y_l)\), consisting of a prompt \(x\), a winning response \(y_w\), and a losing response \(y_l\). We assume these preferences follow the Bradley-Terry model, which posits that the probability of one response being preferred over another depends on the difference in their latent rewards.

Here, \(r^*(x, y)\) represents the ground-truth reward. DPO leverages a mathematical trick to express this reward in terms of the optimal policy \(\pi^*\) and a reference policy \(\pi_{\text{ref}}\). This allows us to train the model directly on preference data without a separate reward model.
The objective function for DPO attempts to maximize the likelihood of the preferred choices:

In this equation, \(\beta\) is a hyperparameter controlling the deviation from the reference model.
The Problem with DPO’s Focus
If we analyze the gradient of the DPO loss, we find that it assigns higher weights to sample pairs where the model’s implicit reward estimate is incorrect (i.e., where the model currently thinks the loser is better than the winner).
Theoretically, this makes sense: you want to learn from your mistakes. However, empirically, LLM preference data is often noisy, and some “hard” negatives may be outliers or inherently confusing. Recent studies (Chen et al., 2024) have shown that despite DPO emphasizing these cases, the model rarely improves on them. The gradient updates try to force the model to correct these rankings, but the model struggles to do so, potentially at the expense of general performance.
Core Method: FocalPO
The researchers introduce FocalPO, a method inspired by the famous “Focal Loss” used in computer vision object detection.
Inverting the Focal Loss
In computer vision, Focal Loss was designed to address class imbalance (e.g., millions of background pixels vs. a few object pixels). It added a modulating factor \((1-p)^\gamma\) to the cross-entropy loss. If the model was confident (\(p \approx 1\)), the term \((1-p)\) would be near zero, down-weighing the loss. This forced the model to focus on “hard” examples where it had low confidence.
FocalPO takes this concept but inverts the objective. Because “hard” preference pairs in LLM training are often noisy or unlearnable, FocalPO aims to reduce their influence.
The proposed loss function is:

Here, \(p(y_w \succ y_l \mid x)\) is the probability the model assigns to the correct ranking. The modulating factor is \(p^\gamma\) (or specifically a negative power approximation in the derivation, simplified to \(p^\gamma\) for optimization).
Let’s visualize the difference between the traditional Focal Loss (aimed at hard samples) and the adapted FocalPO (aimed at easy/correct samples).

- Left (Original Focal Loss): As probability \(p\) increases (the model is correct), the loss drops to zero very fast. It ignores easy samples.
- Right (FocalPO): The modulating factor \(p^\gamma\) acts differently. When the model is correct (\(p\) is large), the factor is close to 1, meaning the loss remains standard. When the model is wrong (\(p\) is small), the factor approaches 0, effectively silencing the loss for that sample.
Gradient Analysis: The Bell Curve
To truly understand how this affects training, we need to look at the gradients. The gradient tells us how much the model parameters \(\theta\) should change based on a specific training example.
First, let’s look at the gradient for standard DPO:

The term \(\sigma(\hat{r}_\theta(y_l) - \hat{r}_\theta(y_w))\) essentially says: “If the reward for the loser is higher than the winner (a mistake), make this gradient update larger.”
Now, compare this to the gradient for FocalPO:

The FocalPO gradient introduces extra terms involving \(\sigma^\gamma\) and \(\log \sigma\). This mathematical complexity results in a very specific behavior regarding how much weight is assigned to different types of errors.
We can visualize these gradients to see the behavior clearly:

- Blue Dashed Line (DPO): The weight is highest when the reward difference is negative (the model is wrong). It keeps pushing harder the more wrong the model is.
- Black Solid Line (FocalPO): Notice the bell shape.
- Far Left (Very Wrong): The gradient weight drops to near zero. FocalPO essentially gives up on these “lost causes” or noisy data points.
- Far Right (Very Confident): The weight is low because the model has already learned this.
- Center-Left (Uncertain/Slightly Wrong): The peak of the curve is around the decision boundary.
This “bell-shaped” gradient profile is the secret sauce. FocalPO encourages the model to learn from pairs where the reward margin is close to zero (the “borderline” cases) or where it is slightly correct, while avoiding excessive influence from samples that are impossibly hard (potentially due to data noise) or already perfectly learned.
Experiments and Results
The authors evaluated FocalPO against DPO and several variants (KTO, ORPO, SimPO) using Mistral-Base (7B) and Llama-3-Instruct (8B). They utilized the UltraFeedback dataset for training.
Main Benchmarks
The models were tested on Alpaca Eval 2.0 (a simulated chatbot arena judged by GPT-4-Turbo) and Arena-Hard (a set of challenging prompts judged by GPT-4).
| Method | Llama-3 (Alpaca WR) | Llama-3 (Arena WR) | Mistral (Alpaca WR) | Mistral (Arena WR) |
|---|---|---|---|---|
| DPO | 47.5 | 20.6 | 18.6 | 16.4 |
| SimPO | 47.5 | 21.5 | 21.4 | 17.0 |
| FocalPO | 49.8 | 22.5 | 20.1 | 17.1 |
FocalPO consistently outperformed standard DPO. Notably, it also outperformed SimPO (Simple Preference Optimization) in most metrics. This is significant because SimPO requires removing the reference model and adding length-normalization terms; FocalPO achieves superior results just by modifying the loss weighting, keeping the architecture standard.
Focusing on Correct vs. Incorrect
To validate the hypothesis that focusing on “correct” rankings is better than focusing on “incorrect” ones, the authors ran a controlled experiment. They compared:
- DPO: Standard baseline.
- FocalPO (Incorrect): Up-weighing hard negatives (similar to computer vision Focal Loss).
- FocalPO (Correct): Up-weighing correct rankings (the proposed method).
The difference in gradient focus between these strategies is illustrated below:

The results of this ablation study were stark:

As shown in Figure 1, the strategy of focusing on incorrect samples (the left-most bars) actually degraded performance compared to standard DPO. Conversely, focusing on correct samples (the right-most bars) provided a significant boost. This empirically confirms that for LLM preference alignment, reinforcing what the model does right (or is on the verge of getting right) is more effective than forcing it to fix deep misconceptions.
Qualitative Improvements
The paper highlights that FocalPO not only improves win rates but also the quality of reasoning. In one example, the model was asked: “How long will it take to walk around the world?”
- DPO Model: Calculated purely based on circumference divided by walking speed, resulting in 335 days. It ignored sleep, rest, and terrain.
- FocalPO Model: Accounted for walking 8 hours a day, terrain, and logistics, estimating approximately 1,002 days.
This answer aligned much more closely with reasoning models like OpenAI’s o1 and GPT-4, suggesting that the “cleaner” gradient signal from FocalPO allows the model to retain better reasoning capabilities during alignment.
Conclusion
FocalPO offers a compelling correction to the standard intuition of preference optimization. By acknowledging that LLM preference data can be noisy and that “hard” negatives are often counter-productive to training, Liu et al. have designed a loss function that strategically shifts focus.
The “Bell Curve” gradient of FocalPO effectively filters out the noise of highly confusing pairs and the redundancy of highly confident pairs, creating a “Goldilocks zone” for training. The result is a model that aligns better with human preference without complex architectural changes or additional reward modeling.
For students and practitioners in NLP, FocalPO demonstrates the importance of critically analyzing loss landscapes. It serves as a reminder that methods imported from other fields (like Computer Vision’s Focal Loss) often require fundamental inversion to work in the context of Language Modeling.
](https://deep-paper.org/en/paper/2501.06645/images/cover.png)