Large Language Models (LLMs) like LLaMA and GPT-4 have transformed how we interact with technology. While these models are technically multilingual, there is a catch: they are predominantly trained on English text. They often treat other languages as second-class citizens, picking them up spontaneously rather than systematically.

This results in a phenomenon known as weak cross-lingual alignment. An LLM might know a fact in English (e.g., “The piano was invented in Italy”) but fail to recall that same fact when queried in Chinese or Russian. The knowledge is “stuck” in the English part of the model’s brain.

In this post, we’ll dive into a paper titled “PREALIGN: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment”. The researchers propose a novel framework that flips the traditional training script. Instead of hoping the model aligns languages during training, they force the model to learn a multilingual dictionary before the heavy lifting of pretraining begins.

The Problem: Spontaneous Alignment is Too Slow

Most multilingual models today are trained using Joint Training. You feed the model a massive mix of English, German, Chinese, and Arabic text, and optimize it to predict the next word. Over time, the model notices patterns—that “cat” and “gato” often appear in similar contexts.

However, research shows this “spontaneous alignment” is inefficient.

  1. It’s slow: The model requires billions of tokens to figure out language mapping.
  2. It’s shallow: While the model might learn grammar well, it struggles to share factual knowledge across languages.
  3. It’s forgetful: Attempts to fix this after pretraining (post-hoc alignment) often degrade the model’s general performance.

The authors of PREALIGN ask a simple question: What if we teach the model that “cat” equals “gato” before it starts reading millions of books?

The Solution: The PREALIGN Framework

PREALIGN is a two-stage framework designed to inject multilingual alignment early and maintain it throughout the training process.

Figure 1: The illustration of PREALIGN. Words in blue, red and green represent translations of piano, guitar and violin, respectively.

As shown in Figure 1 above, the process consists of two distinct phases:

  1. Stage 1: Alignment Injection (Before Pretraining): Initialize the model to produce similar vector representations for translated word pairs.
  2. Stage 2: Input-Only Codeswitching (During Pretraining): Use a special data augmentation technique to ensure the model doesn’t forget the alignment while learning language structure.

Let’s break these down.

Stage 1: Injecting Alignment via Contrastive Learning

Before the model sees the massive corpus of text (like the CulturaX dataset), the researchers create a Multilingual Alignment Table. They take the vocabulary of the source language (English) and use a high-quality translator (like GPT-4) to find corresponding words in target languages (e.g., German Klavier, Chinese 钢琴).

Once they have these pairs, they use Contrastive Learning. The goal is simple: force the model’s internal representation (embedding) of “Piano” to be mathematically close to “Klavier” and “钢琴”.

To do this, they extract the representation of a word \(w\) at layer \(l\):

Equation 1

Here, the model aggregates sub-word tokens into a single word representation. Then, they apply a contrastive loss function. This function pulls aligned words together in the vector space while pushing unrelated words apart:

Equation 2

This loss is calculated across all layers of the model (Equation below) to ensure deep alignment, not just at the surface level.

Equation 3

Finally, to prevent the model from learning a representation that is only good for alignment but terrible for language generation, they add a small Language Modeling (LM) objective during this phase:

Equation 4

Stage 2: Input-Only Codeswitching

Once the model is initialized with these aligned embeddings, the actual pretraining begins. However, if we simply switch to standard training, the model might catastrophic forget the alignment it just learned.

To prevent this, the authors use Codeswitching—mixing languages within a sentence. But they introduce a twist.

Standard codeswitching changes both the input and the target. If the input is “He plays the Klavier”, the model is expected to output “Klavier”. This confuses the model, leading to “mixed-script” output where the model starts randomly switching languages in its responses.

The authors propose Input-Only Codeswitching.

Figure 2: Comparison between vanilla codeswitching and the proposed input-only codeswitching.

As illustrated in Figure 2, they replace a word in the input (e.g., changing “Piano” to “Klavier”), but require the model to predict the original language word in the output context.

Mathematically, instead of the vanilla objective (Eq 5) where the model predicts the switched token:

Equation 5

They use the Input-Only objective (Eq 6), effectively skipping the prediction of the foreign token and focusing on the context:

Equation 6

This forces the model to understand that Klavier functions exactly like Piano in the sentence structure, reinforcing the alignment without corrupting the model’s generation capabilities.

Experimental Setup: The “English-Clone”

To rigorously test this, the authors devised a clever experiment using a synthetic language called English-Clone.

Figure 3: Illustration of the creation of English-Clone.

As seen in Figure 3, English-Clone is identical to English in grammar and vocabulary distribution, but the tokens are mapped to different IDs (e.g., “weather” becomes “weather*”). Since there is zero vocabulary overlap, any transfer of knowledge from English to English-Clone must come from the training method, not lucky guessing.

They evaluated the model on three metrics:

  1. Language Modeling (LM): Perplexity (lower is better).
  2. Zero-Shot Cross-Lingual Transfer (ZS-CLT): Training a classifier on English data and testing it on the target language.
  3. Cross-Lingual Knowledge Application (CLKA): The holy grail. Can the model learn a fact in English and answer it in the target language?

The Results

1. Synthetic Setting (English to English-Clone)

The results in the synthetic setting were striking.

Table 1: Performance of PREALIGN vs other methods

Looking at Table 1:

  • Joint Training (the standard method) achieved a 26.5% accuracy on Cross-Lingual Knowledge Application (CLKA). This is barely better than random guessing, confirming that standard training struggles to transfer knowledge.
  • PREALIGN skyrocketed that accuracy to 90.3%.
  • PREALIGN also achieved better perplexity (16.5 vs 21.6) and better zero-shot transfer (79.3 vs 74.9).

The authors also compared PREALIGN to other alignment strategies, such as doing it “on-the-fly” (during training) or “post-hoc” (after training).

Table 2: Comparison of alignment stages

Table 2 confirms that doing the alignment first (PreAlign) is superior to trying to patch it in later.

2. Why does it work?

The authors analyzed the learning curves to see when the knowledge transfer happens.

Figure 4: Knowledge application accuracy over training steps

Figure 4 is fascinating. Look at the bottom two graphs (Cross-Lingual Knowledge).

  • The Blue Line (Joint Training) stays flat at the bottom. The model learns facts in English but never figures out how to apply them to English-Clone.
  • The Red Line (PREALIGN) shoots up. Because the words are aligned from step 0, as soon as the model learns a fact about a “Piano,” it immediately applies to “Piano*”.

This is further supported by analyzing the cosine similarity of embeddings throughout training.

Figure 5: The evolution of word embeddings cosine similarity

In Figure 5, the Green Line (PREALIGN) starts with high similarity (due to initialization) and maintains it thanks to the codeswitching. The Blue Line (Joint Training) starts near zero and slowly crawls up, but never reaches the high alignment of PREALIGN.

3. Real-World Performance

Synthetic languages are fun, but does this work for real languages? The researchers tested PREALIGN on Chinese (Zh), German (De), Arabic (Ar), and Russian (Ru).

Table 6: Real-world performance results

Table 6 (labeled above as the detailed results table) shows consistent wins:

  • Perplexity (LM): PREALIGN reduces perplexity across all languages compared to Joint Training.
  • Knowledge (CLKA): Significant gains. For example, in Chinese (Zh), knowledge application jumped from 37.8% to 63.8% (in the 400M model).
  • Scale: The benefits persist even as the model size increases from 150M to 1.3B parameters.

4. Generalization to Unseen Words

You might wonder: “Do I need a dictionary for every single word in the language?”

The authors tested this by aligning only the top 25%, 50%, and 75% of frequent words.

Figure 6: Language modeling perplexity on Seen and Unseen words

Figure 6 shows that even for Unseen words (words not in the initial dictionary), PREALIGN (Orange bars) achieves much lower perplexity than Joint Training (Blue bars). The model learns the pattern of alignment and generalizes it to new vocabulary.

Conclusion

PREALIGN demonstrates a crucial insight for Multilingual LLMs: Order matters.

By establishing a “bilingual dictionary” in the model’s neural pathways before forcing it to learn complex grammar and facts, we create a bridge. This bridge allows knowledge acquired in English (which constitutes the vast majority of training data) to flow seamlessly into other languages.

Key takeaways for students and practitioners:

  1. Early Alignment: Initialization is not just random noise; it’s an opportunity to inject structure.
  2. Input-Only Codeswitching: A simple data augmentation trick can prevent “forgetting” without confusing the model’s output generation.
  3. Knowledge Transfer: Solving the language barrier isn’t just about translation; it’s about sharing facts and reasoning capabilities across linguistic boundaries.

As we strive for truly universal LLMs, techniques like PREALIGN offer a promising path away from English-centric biases and toward models that think in concepts, not just keywords.