Introduction

Imagine asking an AI assistant for advice. In English, you might ask about tax laws or baking recipes. But what if you ask in Hindi about a culturally sensitive topic, or in French about a locally controversial political issue?

For years, the “safety” of Large Language Models (LLMs) has been viewed through a predominantly Western, English-centric lens. If a model refuses to generate hate speech in English, we call it “aligned.” But this approach creates a dangerous blind spot. As AI systems are deployed globally, they often fail to recognize insults, threats, or harmful stereotypes that are specific to other languages and cultures. Worse, safety mechanisms trained on English data can sometimes be bypassed entirely simply by translating a harmful prompt into a low-resource language.

This brings us to a critical question in AI research: Alignment to what? And, perhaps more importantly, alignment to whom?

In the research paper The Multilingual Alignment Prism, researchers from Cohere For AI tackle this exact problem. They explore how to balance dual objectives: minimizing “global” harms (universally recognized bad behaviors) and “local” harms (culturally specific offenses) across a non-homogeneous set of languages.

This post will break down their novel dataset, their specific training recipes involving Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), and their surprising findings on how learning safety in one language can transfer to another.

Background: The Gap in Multilingual Safety

Before diving into the solution, we need to understand the nuance of the problem. Most current LLMs are trained on massive scrapes of the internet, where English is the dominant language. Consequently, safety training—the process of teaching a model what not to do—usually relies on English datasets.

When developers want to make a model safe in other languages, they often rely on translation. They take English safety data, run it through Google Translate, and train the model. While better than nothing, this approach fails to capture:

Nuance: Direct translations often sound unnatural (“translationese”) and miss the idiomatic gravity of the original text.
Cultural Context: What is considered offensive in the US might be confusing or irrelevant in Japan or Egypt, and vice versa.

Defining the Harms

The researchers distinguish between two vital categories of harm:

Global Harm: Content that is accepted, understood, and recognized as harmful across global contexts (e.g., instructions on how to build a biological weapon or encourage suicide).
Local Harm: Content that requires deep understanding of a specific culture, history, or vernacular to grasp. This might include slurs against specific indigenous groups or regional political misinformation.

The Aya Red-teaming Dataset

To study this, you can’t just download a dataset from HuggingFace; it didn’t exist. The researchers had to build it. They employed compensated native speakers of 8 languages (English, Hindi, French, Spanish, Russian, Arabic, Serbian, and Filipino) to create the Aya Red-teaming dataset.

“Red-teaming” refers to the practice of acting like an adversary—intentionally trying to break the model or make it say something terrible.

Table 1: Aya Red-teaming dataset statistics.

As shown in the table above, they collected thousands of examples, meticulously tagging them as “Global” or “Local.” This distinction is the “prism” through which we can understand model behavior.

To give you a concrete sense of what “Local Harm” looks like compared to “Global Harm,” look at the examples below. Notice how the Local examples rely heavily on specific cultural knowledge (like the status of the Badjao group in the Philippines or specific terms in Hindi).

Table 4: Examples of prompts in 6 languages from the Aya Red-teaming dataset

Core Method: The Alignment Recipe

Having a dataset is only step one. The core of this paper is an investigation into how to use this data to make models safer without making them “dumber” (i.e., losing their general helpfulness).

The researchers compare different training pipelines. To understand them, we need to define two major techniques in modern LLM training:

SFT (Supervised Fine-Tuning): You show the model a prompt and the correct answer. The model learns to mimic that answer.
DPO (Direct Preference Optimization): You show the model a prompt and two answers (a winner and a loser). The model adjusts its internal probability weights to make the winning answer more likely and the losing answer less likely.

The Synthetic Data Pipeline

Since human-annotated data is scarce (expensive and slow to collect), the researchers used a clever synthetic pipeline to scale up their training set:

Seed: Take the human red-teaming prompts.
Augment: Use a strong multilingual model (Command R+) to rephrase and generate new, similar harmful prompts.
Generate Pairs: For every prompt, generate two responses using different models.
Judge: Use GPT-4 to act as a judge. It looks at the two responses and decides which one is safer. This creates the “Preferred” (Safe) and “Rejected” (Harmful) pairs needed for DPO.

The Training Candidates

The paper tests four specific configurations to see which yields the best balance of safety and capability:

Base Model: The raw pre-trained Aya 23 8B model (no safety training).
SFT-Random: Fine-tuning on random completions (a baseline to check if data quality matters).
SFT-Preferred: Fine-tuning only on the “Safe” responses chosen by the GPT-4 judge.
DPO(IFT): Applying Direct Preference Optimization directly on top of the base model.
DPO(SFT): The “Gold Standard” approach. First, perform SFT using the safe responses to get the model into a good state. Then, apply DPO to refine the alignment further.

The Mathematics of DPO

Why use DPO? Traditional Reinforcement Learning from Human Feedback (RLHF) requires training a separate “Reward Model” which is complex and unstable. DPO simplifies this by treating the language model itself as the reward model.

The objective function they optimize is:

Equation for DPO Loss

In simple terms, this equation forces the model (\(\pi_\theta\)) to increase the likelihood of the preferred response (\(y_+\)) and decrease the likelihood of the rejected response (\(y_-\)), relative to a reference model (\(\pi_{ref}\)). The \(\beta\) parameter controls how much the model is allowed to deviate from the reference.

Experiments & Results

The researchers evaluated the models on two axes:

Safety: What percentage of generations are harmful? (Tested via the Aya Red-teaming set).
General Performance: Is the model still useful? (Tested via the Dolly-200 benchmark for open-ended generation and FLORES-200 for translation).

1. The Trade-off Myth

There is a common belief in AI that “safety taxes performance”—that making a model safe makes it evasive and less helpful.

The results from this paper challenge that view.

Figure 1: Trade-off between general performance and safety performance

Look at Figure 1. Ideally, we want to be in the bottom-right corner: High Win-rate (very helpful) and Low Harmful Generations (very safe).

The Pink Zone represents the Base model: Highly harmful.
SFT (Green Circle): significantly reduces harm but improves general capabilities only moderately.
DPO(SFT) (Red Star): This is the winner. It achieves the lowest harm rate (~10-15%) while simultaneously achieving the highest general capability win-rates (>70%).

This proves that with the right technique—specifically doing SFT before DPO—we can improve safety and general capability together.

2. DPO(SFT) vs. DPO(IFT): Initialization Matters

A fascinating technical finding is the difference between DPO(SFT) and DPO(IFT).

Figure 3: Percentage of harmful model generations

In Figure 3(b), compare the Blue circles (DPO on top of SFT) with the Orange diamonds (DPO directly on the base model). The Orange diamonds are consistently higher (more harmful).

This suggests that DPO is sensitive to initialization. If you try to optimize preferences on a raw, unaligned model, it struggles. Giving the model a “warm-up” via Supervised Fine-Tuning (SFT) first sets a strong foundation, allowing DPO to do the fine-grained work of safety alignment much more effectively.

3. The Multilingual “Tide Lifts All Boats”

Did the safety improvements work for all languages, or just the high-resource ones?

Figure 2: Relative % change in harmful generations

Figure 2 (Left) shows the relative drop in harmful generations compared to the base model. Every single language sees a reduction (negative bars are good here).

Arabic and Hindi saw massive reductions (over 70-80%).
French was harder to align, showing smaller gains. The authors hypothesize this might be due to the specific distribution of French data in the base model training.

Figure 2 (Right) breaks this down by “Global” vs. “Local” harm. Interestingly, Global harms (Blue bars) were generally easier to mitigate than Local harms (Orange bars). This makes sense: global concepts (violence, self-harm) are likely more represented in the massive pre-training data than specific cultural insults.

4. Cross-Lingual Transfer: The Surprise

Here is perhaps the most scientifically intriguing part of the paper. The researchers ran ablation studies where they trained models only on Global harms or only on Local harms, and then tested them on everything.

They wanted to answer: Does learning not to be racist in Hindi help the model refuse a bomb-making request in French?

Figure 4: Relative % change in harmful generations on Global vs Local sets

The answer is Yes.

Look at the bars in Figure 4. Even when the model was trained only on “Local” harms (the middle green bars in each group), it still achieved a massive reduction in Global harms. In fact, for the DPO(SFT) model (bottom row), training on Local harms reduced Global harms by over 70%!

This implies a “General Safety” latent concept. When the model learns to recognize specific, nuanced cultural harms, it seems to generalize that understanding to broader, universal harms. It learns the concept of safety, not just a list of banned words.

Conclusion & Implications

The Multilingual Alignment Prism moves the field of AI safety forward by proving that we cannot rely on English-centric methods for a global world.

Key Takeaways for Students:

Data Diversity is King: You cannot align for local cultures without local data. The Aya Red-teaming dataset proves that cultural nuance requires specific, human-annotated examples.
The Recipe Matters: You can’t just throw DPO at a raw model. The sequence of SFT \(\rightarrow\) DPO provides the stability needed for state-of-the-art results.
Safety \(\neq\) Stupidity: Rigorous safety training, when done correctly, does not have to degrade the general intelligence of the model.
Transfer Learning exists in Safety: Training a model to be culturally sensitive (Local harm) helps it understand universal safety boundaries (Global harm).

As we build the next generation of AI, this paper serves as a blueprint. It reminds us that “alignment” is not a single target, but a prism—refracting differently across every language and culture we serve. To build truly safe systems, we must look through the entire prism, not just the single beam of English.

Introduction#

Background: The Gap in Multilingual Safety#

Defining the Harms#

The Aya Red-teaming Dataset#

Core Method: The Alignment Recipe#

The Synthetic Data Pipeline#

The Training Candidates#

The Mathematics of DPO#

Experiments & Results#

1. The Trade-off Myth#

2. DPO(SFT) vs. DPO(IFT): Initialization Matters#

3. The Multilingual “Tide Lifts All Boats”#

4. Cross-Lingual Transfer: The Surprise#

Conclusion & Implications#