If you’ve ever used Google Translate to finish a Spanish assignment or interpret a menu in Tokyo, you know the results are usually functional but often lack “soul.” The grammar might be correct, but the cultural nuance—the idiom, the local context, the vibe—is often lost.
In the world of Large Language Models (LLMs), we are facing a similar crisis on a massive scale. We want LLMs to speak every language fluently. However, gathering high-quality training data in languages like Russian, Chinese, or Swahili is much harder than finding it in English. The industry standard solution? Take high-quality English data and machine-translate it into the target language.
But is this translated data actually “good” data? Or are we just teaching our models to speak a weird, robotic dialect of “Translationese”? And perhaps more worryingly, are the tests we use to evaluate these models—which are often also translated from English—capable of even spotting the difference?
A fascinating research paper titled “Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?” investigates these exact questions. Let’s dive into their findings.
The Problem: The English-Centric Loop
To understand this paper, we first need to understand Instruction Tuning. After a base model (like Llama 2) learns to predict the next word, it undergoes instruction tuning to learn how to be a helpful assistant. This requires datasets of prompts and responses (e.g., “Write a recipe for paella” -> “Here is a recipe…”).
Most of these datasets are in English. To build a multilingual model, researchers often translate these English instructions into other languages.
The authors of this paper pose two critical hypotheses:
- The Data Problem: Translated data carries imperfections. It lacks language-specific culture (e.g., knowing who a local celebrity is) and contains “translationese” (unnatural phrasing).
- The Evaluation Problem: If we evaluate these models using benchmarks that were also translated from English, we might not see the performance degradation. The model might be great at solving translated English math problems but terrible at chatting like a native speaker.
The Setup: Native vs. Translated
To test this, the researchers set up a controlled experiment using three languages: Spanish (es), Russian (ru), and Chinese (zh). They compared two types of training data:
- Native Data: They used the Aya dataset, a project where volunteers wrote prompts and responses directly in their native languages.
- Translated Data: They took the English section of the Aya dataset and translated it into Spanish, Russian, and Chinese using two methods:
- Google Translate: A standard commercial engine.
- Cohere Command R: A strong LLM prompted to translate while preserving format.
They then fine-tuned several popular base models (Llama 2, Gemma, and Qwen 1.5) using either the Native or the Translated data and compared the results.
The Investigation
The results were revealing, but they depended heavily on how the models were tested. The researchers categorized their benchmarks into two types: Native Benchmarks (tests created originally in the target language) and Translated Benchmarks (tests translated from English).
Finding 1: Native Tests Reveal the Gap
When the models were evaluated on benchmarks created by native speakers (such as TyDi QA for Russian or CMMLU for Chinese), a clear pattern emerged.

As shown in Figure 1, models trained on Native data (the green bars) consistently outperformed those trained on translated data (orange and blue bars).
- TyDi QA (Top chart): The Native Llama-2 model scored ~28.3%, while the Google-translated version dropped to ~25.5%.
- CMMLU (Bottom chart): The gap is distinct across almost all model sizes.
This confirms the first suspicion: Native data is superior. It captures cultural knowledge and linguistic patterns that translation simply misses.
Finding 2: Translated Tests Hide the Truth
Here is where it gets tricky. When the researchers ran the exact same models on popular translated benchmarks (like MGSM for math or MT-MMLU for general knowledge), the performance gap vanished.

Look at the MGSM and MT-MMLU charts in Figure 2 (bottom row). The green, orange, and blue bars are almost identical.
If you only looked at these benchmarks, you would conclude that “translated training data is just as good as native data.” This is a dangerous illusion. Because the test itself is a translation of an English concept, it doesn’t penalize the model for lacking local cultural knowledge. It essentially tests the model’s ability to translate English logic into the target language—exactly what the translated training data taught it to do.
Finding 3: Generative Tasks Don’t Lie
The researchers found that Generative Tasks (where the model has to write a paragraph rather than choose A, B, C, or D) are much harder to “game.”
On XQuAD (a translated Question Answering task, top left of Figure 2), the native data maintained a massive lead. Similarly, in open-ended evaluations where GPT-4 was used as a judge to score response quality, native data usually won out.

Figure 3 highlights this nuance. When asked to generate open-ended answers, the models trained on native data generally produced better responses, specifically when evaluated by a high-quality judge like GPT-4 (top right).
Furthermore, the “better” the base model is, the more the data quality matters.

Figure 4 shows a strong correlation for tasks like XQuAD and QA-GPT4. This suggests that as our LLMs get smarter (stronger base models), the bottleneck becomes the data quality. Using cheap translated data will hurt a smart model more than it hurts a dumb one.
The “Sherlock Holmes” Moment: Why the Gap?
We know there is a gap. But is it because:
- Translation Defects: The grammar and style of the translated training data are bad?
- Knowledge Mismatch: The translated data is talking about American concepts (e.g., the Super Bowl) that are irrelevant to a Chinese or Russian user?
To isolate these factors, the authors devised a clever experiment called Round-Trip Translation (RTT).

As visualized in Figure 5, they took the Native data (e.g., a Russian question about Russian history), translated it to English, and then translated it back to Russian.
- The Resulting Data: It contains “translation noise” (defects), but it retains the original cultural knowledge of the native dataset.
If “Translation Defects” were the main problem, this RTT data should perform poorly (like the English-origin translated data). If “Knowledge Mismatch” was the problem, this RTT data should perform well (like the Native data).
The Verdict:

Look at Table 2. The RTT (zh-origin) columns consistently beat the translated (en-origin) columns.
- For Qwen1.5-7B, the RTT score is 68.9% (close to the native score) while the standard translated data is 68.4%.
- In other experiments, the RTT performance was surprisingly close to the pure Native performance.
This implies that knowledge is king. The biggest issue with translated datasets isn’t that the grammar is slightly off; it’s that the content itself is culturally irrelevant. A model trained on translated English data learns about American presidents and baseball, while a model trained on Native Chinese data learns about Chinese dynasties and poetry.
Can We Fix It?
Realistically, we can’t always get massive native datasets for every language. So, if we have to use translated data, can we mitigate the damage?
The paper explores two regularization techniques:
- Lower Learning Rate: Slowing down the learning process to prevent the model from overfitting to the “translationese” style.
- Multilingual Tuning: Mixing many languages together during training.
The results were mixed.

As seen in Table 4, using a lower learning rate (\(10^{-6}\) vs \(10^{-4}\)) helped bridge the gap on some tasks like TyDi QA for Llama-2. The model essentially learned the “instruction format” without memorizing the weird translated phrasings.
However, for generative tasks like XQuAD, the gap remained stubborn.

Table 6 is stark. No matter the learning rate or multilingual setup, the Native data (first row per model) dominates the Translated data (second/third rows). You cannot easily regularize your way out of a knowledge deficit when the task requires creative generation.
Conclusion: The “Bad Eval” Trap
This paper serves as a wake-up call for the LLM community.
If you are building a multilingual model, translated data is a compromise. It works fine for logic, math, and multiple-choice questions where cultural context is minimal. However, for open-ended conversation and cultural fluency, there is no substitute for native data.
More importantly, if you are evaluating a multilingual model, translated benchmarks are a trap. They can create a false sense of security, making you believe your model is fluent in Spanish or Chinese when it is actually just fluent in “Translated English.”
Key Takeaways for Students & Researchers:
- Be Skeptical of Scores: If a paper claims “State-of-the-art Multilingual Performance” but only tests on translated MMLU, take it with a grain of salt. Look for native benchmarks like C-Eval or TyDi QA.
- Context Matters: The “Knowledge Mismatch” finding proves that language is not just a code to be deciphered; it’s a vessel for culture. You cannot separate the language from the knowledge it carries.
- Generative Testing is Better: Multiple-choice questions are easy to game. To see if a model truly understands a language, ask it to write.
As we strive for truly global AI, we need to move beyond simply translating our existing English resources. We need to invest in creating native data and, perhaps even more urgently, native evaluations.
](https://deep-paper.org/en/paper/2406.12822/images/cover.png)