Lost in Translation? Why Machine Translation Might Be the Secret Weapon of Multilingual AI

If you have been following the explosion of Natural Language Processing (NLP) over the last few years, you are likely familiar with the heavy hitters: BERT, GPT, and T5. These models have revolutionized how machines understand human language. Recently, the focus has shifted toward multilingual models—systems capable of understanding and generating text in dozens, sometimes hundreds, of languages simultaneously.

However, there is a scientific “crisis of comparison” in this field. When a new multilingual model is released, it often differs from previous models in every conceivable way: it has a different architecture, it was trained on a different dataset, it has a different number of parameters, and it uses a different training objective. When Model B outperforms Model A, is it because the architecture is better? Or was it just trained on more data?

In the paper “A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives,” researchers from the University of Helsinki and Université Bretagne Sud set out to solve this “apples-to-oranges” comparison problem. They created a strictly controlled environment to answer a fundamental question: Does teaching a model to translate between languages create a better “understanding” of language than standard monolingual modeling?

In this post, we will break down their methodology, explore the specific architectures they compared, and analyze why “Machine Translation” might be an underappreciated powerhouse for pretraining foundation models.

The Problem: Apples, Oranges, and Pears

To understand why this paper is important, we first need to look at how modern Language Models (LMs) are usually pretrained. Generally, they fall into two camps regarding their objective (what the model tries to achieve during training):

Monolingual Objectives: The model looks at text in Language A and tries to predict missing words or the next word in Language A. It might do this for many languages, but it rarely sees explicit links between languages during training.
Cross-lingual Objectives: The model is explicitly shown how Language A relates to Language B (e.g., via translation pairs).

The hypothesis is that translation (an explicit cross-lingual signal) should force the model to learn deeper semantic representations. If a model knows that “chat” in French and “cat” in English refer to the same concept, it has learned something profound about the world, not just about French or English grammar.

However, proving this hypothesis is difficult because existing models like mBERT (Masked Language Model) and mT5 (Encoder-Decoder) are too different to compare fairly. This paper removes those variables.

The Controlled Environment

The researchers focused on two primary variables: Model Architecture and Pretraining Objectives. To ensure a fair fight, they controlled for everything else:

Same Data: All models were trained on the exact same dataset, a combination of the UNPC (United Nations Parallel Corpus) and OpenSubtitles.
Same Size: All models utilized 12 layers, 512 hidden dimensions, and 8 attention heads.
Same Tokenization: A shared vocabulary of 100k BPE (Byte Pair Encoding) pieces.
Same Compute: All models were trained for 600k steps.

The languages selected for this study were Arabic (AR), Chinese (ZH), English (EN), French (FR), Russian (RU), and Spanish (ES).

The Data Constraint

It is worth noting the rigor of their data selection. To compare translation objectives against monolingual objectives, you need bitexts (sentences aligned in two languages).

Table 4: Number of sentences in pretraining corpora.

As shown in Table 4, the researchers used a massive amount of sentence pairs—over 196 million total lines. Crucially, they ensured that every document used for the translation models was also available for the monolingual models. If a document existed in three languages, they greedily assigned it to the least represented language pair to balance the data, ensuring no model had an unfair data advantage.

The Contestants: Five Models, Two Categories

The researchers categorized their models into Double-Stack (Encoder-Decoder architectures like BART or T5) and Single-Stack (Encoder-only like BERT or Decoder-only like GPT).

Let’s look at the five specific objectives they compared.

1. Double-Stack Models (Encoder-Decoder)

These models consist of two main blocks: an encoder that processes the input and a decoder that generates the output.

2-LM (BART Denoising): This uses the standard BART objective. The input text is corrupted (words are masked or shuffled), and the model must reconstruct the original clean text. It operates within a single language at a time.
2-MT (Machine Translation): The model is given a sentence in a source language (e.g., French) and must generate the translation in a target language (e.g., English). This provides a strong, explicit cross-lingual signal.

2. Single-Stack Models

These models consist of just one transformer stack.

MLM (Masked Language Modeling): An Encoder-only model (like BERT). Some tokens in the input are hidden with a [MASK] token, and the model must guess what they are based on the context.
CLM (Causal Language Modeling): A Decoder-only model (like GPT). The model reads a sequence and predicts the next word. It can only “see” words to the left of the current position.
TLM (Translation Language Model): This is a variation of the CLM. The model is fed a sentence and its translation concatenated together. It then performs causal language modeling (predicting the next word) across this long, dual-language sequence.

Visualizing the Objectives

The difference between these objectives can be abstract, so let’s look at exactly what the models see during training.

Table3: Overview of the diffrent objectives considered in this study.Top two rows: two-stacks (encoder-decoder) models; bottom three rows: single-stack (encoder-only or decoder-only) models.

Table 3 provides a concrete example of how the same underlying data is formatted for each objective:

2-LM: Notice how the input has [MASK] tokens. The target output is the clean, original French sentence.
2-MT: The input is French; the target output is English. The model must understand the meaning to succeed.
MLM: Similar to 2-LM, but this is an encoder-only architecture. It predicts the masked words directly rather than generating a whole new sequence.
CLM: The model reads a continuous stream of text and predicts the next token.
TLM: The model reads the French sentence, then a separator token, then the English sentence, predicting tokens one by one.

Evaluation Strategy: Probing vs. Fine-tuning

How do you determine which model is “better”? The researchers used two distinct evaluation methods on downstream tasks (Sentiment Analysis, NER, POS Tagging, and NLI):

Probing: You “freeze” the pretrained model. You do not allow the model’s weights to change. You only train a small classifier on top of the model’s output.

What this tests: The raw quality of the representations the model learned during pretraining. Does the model naturally understand sentiment or grammar without extra help?

Fine-tuning: You allow all the parameters in the model to update during training on the new task.

What this tests: The model’s adaptivity. Can the model effectively reconfigure itself to solve a specific problem?

Key Results

The results revealed a fascinating interplay between architecture and objective. It turns out there is no single “best” objective—it depends entirely on what architecture you are using.

Finding 1: For Encoder-Decoders, Translation is King

When looking at the Double-Stack models (the BART-style architectures), the results were decisive.

Table 1: Accuracy ( times 1 0 0 ) of double-stack models ( pm s.d. over 5 runs).

Table 1 shows the accuracy for probing experiments. The 2-MT model (trained on translation) consistently outperforms the 2-LM model (trained on denoising) across almost every language and task.

Interpretation: The explicit signal of translation forces the encoder-decoder to align representations of different languages more effectively than simply denoising text. If you are building an encoder-decoder model for multilingual tasks, translation data is incredibly valuable.

Finding 2: The Single-Stack Surprise

The results for Single-Stack models were more nuanced and challenged conventional wisdom.

Table 2: Accuracy ( times 1 0 0 ) of single-stack models ( pm s.d. over 5 runs).

Table 2 highlights the probing accuracy for single-stack models.

The CLM Shock: The Causal Language Model (GPT-style) performed surprisingly well in probing scenarios, often beating the Masked Language Model (BERT-style). This is counter-intuitive because BERT models are bidirectional (they see the whole sentence), which usually helps with understanding tasks like NER or POS tagging.
The TLM Disappointment: Despite having access to translation data, the TLM (Translation Language Model) did not dominate the single-stack category. It did not consistently outperform the standard CLM.

Finding 3: Fine-Tuning Levels the Playing Field

While probing tells us about the raw knowledge of the model, fine-tuning tells us about its practical utility.

Table 7: Macro F1 score after model fine-tuning.

Table 7 shows the Macro F1 scores after fine-tuning. Here, the landscape shifts:

MLM Resurgence: While the MLM (BERT) struggled in probing compared to CLM, it becomes highly effective when fine-tuned, particularly for Part-of-Speech (POS) tagging and Named Entity Recognition (NER).
2-MT Dominance: The Machine Translation model (2-MT) remains a top performer even after fine-tuning. It achieves the highest scores in many categories, suggesting that the “translation intuition” it learned during pretraining provides a robust foundation that is easy to adapt to specific tasks.

Why Does This Matter?

This paper offers several critical takeaways for students and researchers in NLP:

1. Architecture Dictates Objective

You cannot simply say “Translation is the best pretraining objective.” It is the best objective if you are using an Encoder-Decoder architecture. If you are using a Decoder-only architecture, standard Causal Language Modeling might be sufficient. This coupling between architecture and objective is often overlooked in broader discussions.

2. The Power of Translation

The most significant finding is the strength of the 2-MT model. In the era of “Foundation Models” trained on massive piles of monolingual web text (CommonCrawl), this research suggests we might be under-utilizing parallel data. A model that learns to translate is forced to learn semantics. It cannot just rely on statistical patterns of which words sit next to each other; it has to understand that “bank” in a financial context translates differently than “bank” in a river context.

3. The Importance of Control

Finally, this paper serves as a tutorial on good experimental design. By rigorously controlling the dataset size and overlap (as seen in Table 4 and Table 5 below), the authors ensured that their conclusions were mathematically valid.

Table 5: Statistics of datasets used for downstream evaluation tasks.

If they had simply compared a pre-downloaded mBERT against a pre-downloaded mT5, the results would have been meaningless because the training data would have been different.

Conclusion

The researchers conclude that Multilingual Translation is a highly effective, yet under-explored, pretraining objective. While the industry races toward larger and larger monolingual models, this paper suggests that the explicit cross-lingual signal found in translation tasks builds superior representations, particularly for Encoder-Decoder models.

For students designing their own models: if you have access to parallel data (bitexts), using a translation objective might give your model a “semantic boost” that simple denoising or language modeling cannot match. The future of multilingual AI might not just be about reading more text, but about learning to translate it.

Lost in Translation? Why Machine Translation Might Be the Secret Weapon of Multilingual AI#

The Problem: Apples, Oranges, and Pears#

The Controlled Environment#

The Data Constraint#

The Contestants: Five Models, Two Categories#

1. Double-Stack Models (Encoder-Decoder)#

2. Single-Stack Models#

Visualizing the Objectives#

Evaluation Strategy: Probing vs. Fine-tuning#

Key Results#

Finding 1: For Encoder-Decoders, Translation is King#

Finding 2: The Single-Stack Surprise#

Finding 3: Fine-Tuning Levels the Playing Field#

Why Does This Matter?#

1. Architecture Dictates Objective#

2. The Power of Translation#

3. The Importance of Control#

Conclusion#