If you follow the rapid evolution of Large Language Models (LLMs), you are likely familiar with the “alignment” phase. After a model consumes terabytes of text to learn how to predict the next token (Pre-training) and learns to follow instructions (Supervised Fine-Tuning or SFT), it undergoes a final polish: Preference Optimization. This is the stage where models like ChatGPT or Claude learn to be helpful, harmless, and conversational, usually via techniques like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO).

However, there is a glaring disparity in this process. Most preference optimization research focuses heavily on English. While we have excellent “aligned” models in English, the performance drops significantly when we switch to Vietnamese, Turkish, or Hindi. This creates an alignment gap where non-English speakers receive lower-quality, and potentially less safe, model interactions.

In a recent paper titled “RLHF Can Speak Many Languages”, researchers from Cohere For AI tackle this problem head-on. They conducted an exhaustive study on how to align models across 23 diverse languages. Their findings are surprising: not only does multilingual preference optimization work, but “online” reinforcement learning methods (specifically RLOO) significantly outperform the currently popular offline DPO method in multilingual settings.

As shown below, their resulting model—an optimized version of Aya 23 8B—outperforms industry heavyweights like Llama-3-8B-Instruct and Mistral-7B-Instruct-v0.3 across these languages.

Win-rates between our preference-trained model with the other state-of-theart open weight models: Averaged across 23 languages, our preference-trained model based on Aya-23-8B significantly outperforms the original Base Aya-23-8B, Gemma-1.1-7B-it, Meta-Llama3- 8B-Instruct, and Mistral-7B-Instruct-v0.3.

Let’s dive into how they achieved this and why this matters for the future of multilingual AI.

The Multilingual Data Problem

Before we can optimize for preferences, we need data. In English, we have massive datasets of prompts where human annotators have ranked two model responses (Response A is better than Response B). In other languages, this data is virtually non-existent.

The standard “quick fix” in previous research was to simply translate English datasets. If you have an English prompt and two English responses, you run them through Google Translate, and voilà—you have a “German” dataset.

The authors argue that this approach is flawed. It introduces translation artifacts—subtle quirks and unnatural phrasings often called “translationese.” If you train a model on this, it learns to sound like a translator machine, not a native speaker. Furthermore, simply translating the same pairs repeatedly limits the diversity of the data.

A Novel Data Strategy

To solve this, the researchers created a new pipeline to generate high-quality synthetic preference data without relying entirely on translation:

Prompts: They took 50,000 English prompts from ShareGPT and translated them into 23 languages using NLLB-3.3B (a high-quality translation model).
Completions (The Clever Part): Instead of just translating English answers, they generated new answers using two different models with different capabilities:

Model A (Command): A model proficient primarily in English. Its answers were generated in English and then translated.
Model B (Command R+): A model explicitly trained for multilingual performance. It generated answers directly in the target language.

This created a natural hierarchy. The “direct” generations from the multilingual model were generally superior and more natural than the “translated” generations. This allowed the researchers to construct preference pairs (Better vs. Worse) where the model learns to prefer natural, native-sounding text over “translationese.”

The Core Method: Offline vs. Online Alignment

Once the data was ready, the researchers faced a methodological question: Which algorithm should be used to align the model?

In the current open-source landscape, Direct Preference Optimization (DPO) is the default choice. It is an “offline” method, meaning it looks at a static dataset of preferences and optimizes the model without needing to generate new text during training. It is computationally efficient and stable.

However, the traditional way to do alignment is Reinforcement Learning (RL), specifically “online” methods like PPO or the method used in this paper, REINFORCE-Leave-One-Out (RLOO). In online methods, the model generates new responses during training, gets scored by a Reward Model, and updates its parameters.

The authors hypothesized that in a complex, heterogeneous setting like multilingualism, the “online” generation aspect might be crucial.

The Mathematics of Alignment

To understand the difference, let’s look at the objectives.

1. The Standard RLHF Objective Online methods train a policy \(\pi_{\theta}\) (the LLM) to maximize a reward \(r_{\phi}\) while staying close to the original reference model \(\pi_{\text{ref}}\) (to prevent it from outputting gibberish to game the reward system).

Equation showing the max over policy pi_theta of the expectation of reward minus beta times KL divergence.

The term \(p_{KL}\) (KL Divergence) ensures the model doesn’t drift too far from its original training:

Equation defining p_KL as the Kullback-Leibler divergence between the current policy and the reference policy.

2. Direct Preference Optimization (DPO) DPO re-derives this objective to remove the reward model entirely. It treats the problem as a classification task where we increase the probability of the “winning” response (\(y_+\)) and decrease the “losing” response (\(y_-\)).

Equation showing the DPO loss function: negative log sigmoid of beta times log probability ratios.

3. RLOO (Online Method) The authors chose RLOO as their online contender. RLOO is a variant of the REINFORCE algorithm. It generates \(k\) samples for a single prompt during training. To calculate the “advantage” of one specific sample, it compares that sample’s reward against the average reward of the other samples (leaving one out).

Equation showing the RLOO gradient estimator with the leave-one-out baseline.

This method provides the benefits of online exploration (the model sees its own current mistakes and successes) without the massive memory overhead of PPO.

Experiments and Results

The researchers utilized Aya 23 8B, a state-of-the-art multilingual base model, and trained it using both DPO and RLOO across different data mixtures:

English-Only: To test if English data helps other languages.
Multilingual-5: English + 4 others (to test transfer to unseen languages).
Multilingual-23: All supported languages.

Here is what they found.

1. Online RLOO Beats Offline DPO

This is perhaps the most significant finding for the broader ML community. Across the board, the online RLOO method outperformed DPO.

In the table below, you can see the win rates against the base model. While DPO improves the model (47.0% win rate implies parity or slight improvement given ties, improving to 50.2% with more data), RLOO achieves a 54.0% win rate on the standard mixture.

Table 3: Open-ended generation (Dolly) win-rates for DPO/RLOO preference optimized Aya models against the original Aya 23 8B on English (left) and averaged over 23 languages (right).

The authors attribute this to the fact that online methods allow the model to explore and receive feedback on its own generations, which is critical when navigating the complex grammar and nuance of 23 different languages simultaneously.

2. Cross-Lingual Transfer is Real

A major question in multilingual AI is: Does learning to be helpful in English make the model helpful in Arabic?

The answer is yes, but with a catch.

English-only training (Fig 2c, left bars) does improve performance on unseen languages slightly.
Multilingual training (Fig 2c, middle bars) turbocharges this transfer. When the model is trained on just 5 languages, its performance on the other 18 unseen languages jumps significantly, especially with RLOO.

Figure 2: DPO and RLOO win-rates as number of languages in training data increases.

Look at chart (c) in the image above. The blue bar (RLOO) for “ML-5” (training on 5 languages) shows a massive jump in win rate on unseen languages compared to training on English alone. This suggests that the model learns a general “alignment concept” that transcends specific languages, provided it sees enough linguistic diversity during training.

3. Scaling Languages Improves Performance

It might seem intuitive that adding more languages helps, but in machine learning, “negative transfer” (where learning Task A hurts Task B) is a common fear.

The study confirms that adding more languages generally improves performance. Moving from English-only to all 23 languages resulted in the highest average win rates. Importantly, this didn’t come at a cost to English performance. As seen in the table below regarding summarization tasks, the RLOO model trained on 23 languages (ML-23) achieved a staggering 70.7% win rate against the base model.

Table 6: 15 language XLSum win-rate results for DPO/RLOO preference optimized Aya 23 8B on multiple training data mixtures.

4. No “Alignment Tax” on Intelligence

A common concern with RLHF is the “alignment tax”—the idea that making a model friendlier makes it dumber at hard tasks like math or reasoning.

The authors tested the models on benchmarks like mMMLU (knowledge), MGSM (math), and XCOPA (reasoning). As shown in the table below, the preference-optimized models maintained the reasoning capabilities of the base model. The alignment process improved distinct instruction-following capabilities without degrading the core intellect of the model.

Table 7: Results for discriminative unseen (held-out) task evaluation. Results are reported as the zero-shot performance averaged across all languages of XCOPA, XStoryCloze, and XWinoGrad.

Conclusion and Implications

The “RLHF Can Speak Many Languages” paper marks a turning point for multilingual open-weights models. By moving away from English-centric data and demonstrating the superiority of online RLOO training, the authors have provided a recipe for closing the language gap in AI.

Key Takeaways for Students and Practitioners:

Don’t rely solely on DPO: While DPO is easy to run, online methods like RLOO offer superior performance, likely because they force the model to fix its own generation errors during training.
Data Quality > Quantity: Generating fresh, diverse responses (Command vs Command R+) beats simply translating existing English datasets.
Language Synergy: Training on a cluster of languages helps the model generalize to languages it hasn’t even seen during the alignment phase.

By following these principles, the research team produced a model that not only speaks 23 languages but does so with a level of nuance and helpfulness that rivals the best English-first models available today. As the field moves forward, we can expect “multilingual-by-design” to become the new standard for preference optimization.

The Multilingual Data Problem#

A Novel Data Strategy#

The Core Method: Offline vs. Online Alignment#

The Mathematics of Alignment#

Experiments and Results#

1. Online RLOO Beats Offline DPO#

2. Cross-Lingual Transfer is Real#

3. Scaling Languages Improves Performance#

4. No “Alignment Tax” on Intelligence#

Conclusion and Implications#