Introduction
In the current landscape of Artificial Intelligence, Large Language Models (LLMs) have achieved celebrity status. Models like GPT-4, Claude, and Gemini can write poetry, code in Python, and summarize complex legal documents. However, there is a hidden cost to this brilliance: Alignment.
Pre-trained models are, by default, unruly text completers. To make them helpful assistants that follow instructions and avoid toxic output, they must undergo a process called alignment, typically involving Reinforcement Learning from Human Feedback (RLHF). This process requires massive datasets where humans rate model outputs (e.g., “Response A is better than Response B”).
Here lies the problem: The Language Gap.
Collecting high-quality human preference data is expensive and time-consuming. Consequently, the vast majority of this data exists in English. If you want to build a safe, aligned chatbot in Vietnamese, Turkish, or Swahili, you generally face a “cold start” problem—you just don’t have the labeled data to train the reward models necessary to align the AI.
But what if you didn’t need it?
In a fascinating paper titled “Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment,” researchers from MIT and Google DeepMind propose a counter-intuitive solution. They ask: Can we use a Reward Model trained on English preferences to align a model generating text in Spanish, German, or Russian?
The answer is a resounding yes. In this post, we will tear down this research to understand how we can “reuse rewards” to democratize safe AI across languages.
Background: The Alignment Pipeline
To understand why this cross-lingual transfer is impressive, we first need to understand the standard recipe for aligning an LLM. This usually happens in three stages:
- Supervised Finetuning (SFT): We take a raw, pre-trained base model (which just predicts the next word) and train it on a dataset of “Instruction \(\rightarrow\) Response” pairs. This teaches the model the format of a helpful assistant. We denote this model as \(\pi_{\text{SFT}}\).
- Reward Modeling (RM): This is the critic. We train a separate model, \(r(x, y)\), to look at an input prompt \(x\) and a response \(y\), and output a scalar score indicating how “good” the response is. This model is trained on human preference data (humans picking the winner between two options).
- Reward Optimization: We use the Reward Model to update the SFT model. We encourage the model to generate responses that get high scores from the RM. This is typically done via Reinforcement Learning (specifically PPO) or a method called “Best-of-N” reranking.
The Mathematical Critic
The Reward Model is the bottleneck for multilingual AI. It requires human judgments. If we have a dataset of pointwise judgments (Good vs. Bad), we train the RM to minimize the following loss:

In simple terms, this equation trains the model to assign high probabilities to “good” responses (\(z=1\)) and low probabilities to “bad” ones (\(z=0\)).
More commonly, we use pairwise feedback, where a human chooses a winner (\(y_w\)) and a loser (\(y_l\)). The RM is trained to maximize the gap between the scores of the winner and loser:

Once we have this RM, we can run Reinforcement Learning. The objective function for RL seeks to maximize the reward while ensuring the model doesn’t drift too far from the original SFT model (measured by KL-divergence):

The Catch: In a standard setting, if you want an aligned Spanish model, you need a Spanish SFT model and a Spanish Reward Model trained on Spanish human preferences. The researchers propose that we can skip the latter.
The Core Method: Cross-Lingual Reward Transfer
The researchers propose a novel setup: Reward Model (RM) Transfer.
The intuition is relatively simple. Concepts of “quality”—like helpfulness, safety, and coherence—are likely universal. A good summary is a good summary, regardless of whether it is written in English or German. Furthermore, modern base models (like PaLM or mT5) are multilingual. They map different languages into a shared semantic space.
Therefore, a Reward Model trained on English data should essentially learn a “quality function” that exists in this shared space. If we feed it a Spanish response, it should theoretically be able to judge it, even if it was never explicitly trained on Spanish preference data.
The Architecture
Let’s visualize how this differs from the traditional approach.

As shown in Figure 1:
- The Monolingual Path (Standard): You start with a base LM, perform Spanish SFT, train a Spanish RM, and output a Spanish aligned model.
- The Cross-Lingual Path (Proposed): You perform Spanish SFT (using available data), but you use an English RM (or another source language) to guide the optimization.
This is a “Zero-Shot” method because the Reward Model sees zero preference data in the target language.
Experiments & Results
The researchers tested this hypothesis on two distinct tasks:
- Summarization: Using the Seahorse dataset (6 languages: German, English, Spanish, Russian, Turkish, Vietnamese).
- Open-Ended Dialog: Using the OpenAssistant dataset (English, Spanish, Russian).
They evaluated the results using three judges:
- Target-Language RMs: (Ideally, the “ground truth” for that language).
- LLM Judges: GPT-4 and PaLM-2-L (Using large models to act as unbiased human proxies).
- Humans: Native speakers rating the outputs.
Does it actually work?
The results are remarkably positive. When aligning a model using a Reward Model from a completely different language, the resulting model is consistently better than the unaligned SFT baseline.
Take a look at the win rates in Figure 4, judged by PaLM-2-L.

In this chart, “de \(\rightarrow\) en” means using a German RM to align an English model. The dashed line at 50% represents a tie with the unaligned model.
- Consistency: Almost every bar is above 50%. Whether you use Vietnamese to align English or English to align Russian, the model improves.
- Magnitude: In many cases, the cross-lingual model (dark blue) performs comparably to the monolingual model (light gray).
This confirms that the “signal” for what makes a good response is preserved across language barriers in the Reward Model.
The Human Verdict
You might be skeptical of using AI to grade AI. However, the researchers validated these findings with human annotators.

Figure 2 shows that humans prefer the cross-lingually aligned models over the baseline SFT models up to 70% of the time. This is strong evidence that the method isn’t just gaming an automated metric—it’s actually producing better text for native speakers.
The Surprise: When Foreign Judges Are Better
Here is the most startling finding of the paper. Look closely at Figure 3. This scatter plot compares the score increase provided by the Reward Model.

Specifically, look at the Summarization (a) chart. In several instances, the Different-language RM (dark blue dots) results in a better outcome than the Same-language RM (light gray dots).
Why would an English judge be better at grading Spanish than a Spanish judge?
The researchers hypothesize this is due to Regularization of Spurious Artifacts.
When you train a Reward Model on a specific language, it might overfit to surface-level features (artifacts) of that language. For example, it might learn that “longer sentences” or “specific Spanish connecting words” are associated with higher rewards, regardless of the actual content quality.
If you then use this RM to align a Spanish model, the model essentially “hacks” these specific Spanish artifacts.
However, an English RM doesn’t know about the surface-level quirks of Spanish grammar or vocabulary. It can only judge the response based on the underlying semantic meaning (the “quality” embedding). It forces the model to improve the substance of the answer rather than the style.
To prove this, the authors tested how “Bag-of-Words” (BoW) like the RMs were. They found that monolingual RMs behaved more like simple BoW models (relying on keyword matching) than cross-lingual RMs did. The cross-lingual transfer forces the model to be “deeper.”
What If We Have NO Data? (Zero-Shot SFT)
So far, we’ve assumed that while we don’t have Preference Data (for the RM), we do have Instruction Data (for the SFT). But for many low-resource languages, we might not even have that.
Can we rely entirely on translation?
The researchers tested this by translating English SFT data into the target language using Google Translate, and then applying the cross-lingual RM transfer.

Figure 5 breaks this down:
- (a) SFT Quality: The unaligned SFT models trained on translated data (light orange) are generally worse than those trained on organic data (dark teal). This is expected—translation isn’t perfect.
- (b) Best-of-N: When using “Best-of-N” (simply picking the best sample), the performance drops significantly if the SFT data is translated.
- (d) RL (PPO): However, when using Reinforcement Learning (PPO), the gap narrows. The RL process allows the model to “fix” some of the issues introduced by the translated SFT data, guided by the strong signal from the English RM.
This suggests that while organic data is always gold, you can bootstrap a decent model for a new language using translated instructions and an English Reward Model.
Practical Recommendations
If you are a practitioner trying to align a model for a new language (let’s say, Indonesian), which language should you use for your Reward Model?
The researchers analyzed the “transferability” of various languages.

Figure 6 acts as a cheat sheet. The columns represent the Target Language, and the rows represent the Source (RM) language.
- English (en) is consistently a top-tier donor. It transfers well to almost everything.
- High-Resource correlates with success: Generally, languages with better-quality datasets (like English) make for better judges.
The takeaway? If you lack resources, just use the English RM. It is a robust, “universal donor” for alignment.
Analysis: Why Does This Happen?
The researchers dug deeper to ensure the RMs weren’t just measuring something trivial, like length (i.e., “longer is better”).
They evaluated the generalizability of the RMs. They took the source RMs and asked them to rate validation sets in the target languages.

Figure 7 shows the accuracy of these RMs.
- The Cross-lingual RMs (dark teal) consistently perform better than random chance and often beat the Majority Baseline (dashed line).
- Crucially, they often perform comparably to the Length Baseline (dotted line), but the alignment results show they aren’t just learning length.
Furthermore, they checked if the success was simply due to linguistic similarity (e.g., German transferring well to English because they are both Germanic languages). Surprisingly, Linguistic Typology didn’t matter much. Vietnamese (an Austroasiatic language) helped English summarization more than some European languages did.
This suggests that the Base Model’s quality (how well it maps concepts internally) matters more than the specific linguistic distance between the source and target.
Conclusion
The paper “Reuse Your Rewards” offers a promising path forward for multilingual AI. It dismantles the barrier of entry for creating safe, aligned models in low-resource languages.
Key Takeaways:
- You don’t need target-language preference data. An English Reward Model can effectively align a Spanish, Turkish, or Russian model.
- Cross-lingual alignment acts as a regularizer. It can prevent the model from overfitting to language-specific quirks, sometimes leading to better results than native alignment.
- English is a safe bet. When in doubt, an English RM serves as an excellent universal proxy for quality.
By leveraging the multilingual capabilities of pre-trained models, we can “recycle” the massive effort put into English alignment to serve the rest of the world. This brings us one step closer to an AI ecosystem that is not just smart, but equitably safe and helpful for everyone.
](https://deep-paper.org/en/paper/2404.12318/images/cover.png)