One Step to Align Them All: Understanding ORPO

Large Language Models (LLMs) are impressive, but raw pre-trained models are like unpolished gems. They can predict the next token, but they often struggle to follow instructions or adhere to human safety standards. To fix this, we typically rely on a multi-stage training pipeline: Pre-training, Supervised Fine-Tuning (SFT), and finally, Preference Alignment (using methods like RLHF or DPO).

While effective, this pipeline is complex, resource-intensive, and brittle.

In this post, we are diving deep into a paper from KAIST AI that challenges this status quo. The paper, “ORPO: Monolithic Preference Optimization without Reference Model,” introduces a method to merge Supervised Fine-Tuning and Preference Alignment into a single, efficient process. If you are a student of NLP or machine learning, understanding ORPO offers a fascinating look into how we can make model alignment cheaper, faster, and surprisingly, more effective.

The Standard Alignment Pipeline (and its Flaws)

To understand why ORPO is significant, we first need to look at how most modern LLMs (like ChatGPT or Llama-2-Chat) are built. The process usually looks like this:

Supervised Fine-Tuning (SFT): The model is trained on high-quality instruction-response pairs to learn how to format answers and follow orders.
Preference Alignment: The model is further refined using “preference data” (pairs of answers where one is “chosen” and one is “rejected”).

The standard industry methods for step 2 are RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization).

Comparison of model alignment techniques. ORPO aligns the language model without a reference model in a single-step manner by assigning a weak penalty to the rejected responses and a strong adaptation signal to the chosen responses with a simple log odds ratio term appended to the negative log-likelihood loss.

As shown in Figure 2 above, both RLHF and DPO typically require a reference model.

RLHF requires training a separate Reward Model and then using PPO (Proximal Policy Optimization) to update the policy. This is notoriously unstable and sensitive to hyperparameters.
DPO removed the need for a separate Reward Model but still requires a reference model (usually the SFT model) to ensure the new model doesn’t drift too far away from the original distribution.

This dependency creates a computational bottleneck. You often need enough memory to hold two versions of the model (the active policy and the frozen reference) simultaneously, or you have to perform multiple forward passes, doubling your compute costs.

The Hidden Problem with Supervised Fine-Tuning

Before the authors even introduce ORPO, they identify a critical theoretical weakness in how we perform Supervised Fine-Tuning (SFT).

In SFT, we train the model using Cross-Entropy Loss. The goal is to maximize the likelihood of the “reference” (correct) answer. The loss looks like this:

Equation 2 showing the negative log likelihood loss function used in standard SFT.

This equation tells the model: “Make this specific sequence of tokens more likely.” However, it does not tell the model: “Make other, bad sequences less likely.”

The researchers conducted a pilot study to see what actually happens during SFT. They took a model (OPT-350M) and fine-tuned it on the “Chosen” (good) responses from a preference dataset. They then tracked the probability the model assigned to the “Rejected” (bad) responses.

Figure 3: Log probabilities for chosen and rejected responses during OPT-350M model fine-tuning on HHRLHF dataset. Despite only chosen responses being used for supervision, rejected responses show a comparable likelihood of generation.

The results in Figure 3 are revealing. As the model learns the “Chosen” responses (Green line), the probability of the “Rejected” responses (Orange line) also increases.

Because the SFT loss only pushes up the probability of the target tokens, the model effectively learns the domain (e.g., how to structure a dialogue), but it doesn’t learn to discriminate between a high-quality answer and a toxic or hallucinated one. It essentially says, “I recognize this style of text,” regardless of whether the content is good or bad.

Enter ORPO: Odds Ratio Preference Optimization

The authors propose a solution that modifies the SFT stage itself. Instead of doing SFT first and Alignment second, why not penalize the model for “bad” styles during SFT?

This is ORPO. It is a monolithic method, meaning it happens in one block of training. It requires no reference model and no separate reward model.

The Intuition: Odds vs. Probability

To understand ORPO, we must revisit a concept from basic statistics: the Odds.

In probability theory, if the probability of an event \(y\) occurring given context \(x\) is \(P(y|x)\), the odds are defined as the probability of it happening divided by the probability of it not happening.

Equation 4 defining the odds theta of y given x.

If a model thinks a sequence has a 0.8 (80%) probability, the odds are \(0.8 / 0.2 = 4\). This means the model is 4 times more likely to generate this sequence than not.

ORPO focuses on the Odds Ratio (OR) between the chosen response (\(y_w\)) and the rejected response (\(y_l\)).

Equation 5 defining the Odds Ratio between chosen and rejected responses.

This ratio tells us how much more likely the chosen response is compared to the rejected one. If the Odds Ratio is high, the model strongly prefers the good answer. If it is low (near 1), the model is indifferent.

The Objective Function

The magic of ORPO lies in its loss function. It combines the standard SFT loss with a new “Odds Ratio” penalty.

Equation 6 showing the total ORPO loss function as a sum of SFT loss and lambda weighted OR loss.

Here, \(\mathcal{L}_{SFT}\) makes the model better at generating coherent text (domain adaptation). The new term, \(\mathcal{L}_{OR}\), creates the alignment.

The Odds Ratio Loss (\(\mathcal{L}_{OR}\)) is designed to maximize the gap between the chosen and rejected responses. It uses a log-sigmoid function:

Equation 9 showing the specific formulation of the Odds Ratio loss.

When we minimize this loss, we are forcing the Odds Ratio to increase. This effectively tells the model: “While you are learning to talk (SFT), strictly prioritize the chosen answer over the rejected one.”

Why Odds Ratio and not Probability Ratio?

You might wonder why the authors use Odds Ratio (\(\frac{odds_w}{odds_l}\)) instead of a simple Probability Ratio (\(\frac{P_w}{P_l}\)), which is used in other methods like DPO.

The answer is numerical stability and gradient behavior.

The authors analyzed the gradients (the signals used to update the model weights). They found that using a probability ratio creates a very sharp, extreme penalty. If the model assigns a high probability to a rejected response, a probability-ratio loss would “crash” that probability down very aggressively.

Figure 8: The log probability trace when the model is trained with the probability ratio (left) and the odds ratio (right) given the same hyperparameters. The probability ratio leads the rejected responses to have relatively lower log probabilities.

As seen in Figure 8 (left), using a probability ratio forces the log probability of rejected items to plummet rapidly (approaching -8 or lower). While this sounds good, in the context of SFT, it is destructive. It suppresses the model so hard that it forgets the basic structure of language (the domain adaptation).

The Odds Ratio (Figure 8, right) is gentler. It provides a “smooth” penalty. It lowers the likelihood of rejected answers enough to align the model, but not so much that it destroys the model’s ability to generate coherent text.

The authors also visualized the theoretical distribution of these ratios:

Figure 6: Sampled distribution of log PR and log OR. Log OR has a wider range given the same input probability pairs.

The Odds Ratio (Green) offers a wider, more usable dynamic range for the loss function compared to the sharp spike of the Probability Ratio (Blue/Orange).

Does it actually work?

To prove the effectiveness of ORPO, the researchers fine-tuned models like Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) on the UltraFeedback dataset. They compared these against standard baselines (SFT, SFT+DPO, SFT+RLHF).

AlpacaEval Performance

One of the most impressive results comes from the AlpacaEval 2.0 benchmark, which measures how well a model follows instructions compared to a powerful reference model (like GPT-4).

Figure 1: AlpacaEval 2.0 result of Llama-2(7B) and Mistral (7B) fine-tuned with ORPO (blue) in comparison to the state-of-the-art models. Notably, Mistral-ORPO alpha & beta surpasses Zephyr beta and Llama-2-Chat (13B) with a single epoch training exclusively on the UltraFeedback.

In Figure 1, look at the blue bars. Mistral-ORPO-beta (7B) achieves a win rate of 12.20%. This is higher than Zephyr-beta (a popular DPO-aligned model) and even outperforms Llama-2-Chat (13B), a model nearly twice its size.

This result was achieved with a single epoch of training, combining SFT and Alignment into one pass.

Multi-Turn Conversation (MT-Bench)

Single-turn instructions are one thing, but can the model hold a conversation? The authors tested ORPO on MT-Bench.

Figure 4: MT-Bench result of Mistral-ORPO-alpha (7B) and Mistral-ORPO-beta (7B) by the category.

The radar chart in Figure 4 shows that the ORPO-tuned Mistral models (7B) are highly competitive, matching or beating larger models in categories like Reasoning, Extraction, and STEM.

Reward Distribution Analysis

A “aligned” model should generate answers that a reward model scores highly. The authors plotted the distribution of rewards for different training methods on the UltraFeedback test set.

Figure 5: Reward distribution comparison between OPT-125M (left), OPT-350M (middle), and OPT-1.3B (right) trained with SFT (blue), RLHF (green), DPO (orange), and ORPO (red) on the test set of UltraFeedback using the RM-1.3B.

In Figure 5, look at the position of the distributions.

Blue (SFT): Biased toward the left (lower rewards).
Green (RLHF): Shifts right, but notice the weird gaps? This indicates instability or “reward hacking” issues where the model generates garbage that tricks the reward model.
Red (ORPO): The distribution is smooth and shifted furthest to the right (highest rewards).

This visually confirms that ORPO is consistently generating high-quality responses without the instability often seen in RLHF.

Why ORPO Matters for the Future

The significance of ORPO goes beyond just getting a higher score on a leaderboard. It represents a shift in efficiency.

Memory Efficiency: Because ORPO does not require a reference model, you don’t need to load two LLMs into VRAM at once. This makes it possible to align large models on consumer-grade hardware that might otherwise choke on DPO or RLHF.
Training Speed: By combining SFT and Alignment, you eliminate a complete stage of the pipeline.
Simplicity: Hyperparameter tuning for RLHF (PPO) is notoriously difficult. ORPO introduces essentially just one major hyperparameter: \(\lambda\) (the weight of the odds ratio loss), making it much easier to implement.

Conclusion

The research paper “ORPO: Monolithic Preference Optimization without Reference Model” provides a compelling argument that we have been overcomplicating model alignment. By understanding that Supervised Fine-Tuning increases the probability of all similar text—good and bad—the authors identified a root cause of misalignment.

Their solution, applying an Odds Ratio penalty directly during the SFT process, elegantly solves the problem. It forces the model to learn the domain and the human preferences simultaneously. For students and researchers looking to train their own LLMs, ORPO offers a powerful, resource-friendly alternative to the heavy machinery of Reinforcement Learning.

We are likely to see more “monolithic” training methods in the future, as the community moves toward more efficient ways to make our AI assistants helpful, harmless, and honest.

The Standard Alignment Pipeline (and its Flaws)#

The Hidden Problem with Supervised Fine-Tuning#

Enter ORPO: Odds Ratio Preference Optimization#

The Intuition: Odds vs. Probability#

The Objective Function#

Why Odds Ratio and not Probability Ratio?#

Does it actually work?#

AlpacaEval Performance#

Multi-Turn Conversation (MT-Bench)#

Reward Distribution Analysis#

Why ORPO Matters for the Future#

Conclusion#