If you have ever taken a course on Neural Machine Translation (NMT), you likely learned the “Golden Rule” of the field: data is king. To build a system capable of translating between English and German, you traditionally need millions of high-quality, aligned sentence pairs. If you want a multilingual model, you need massive datasets covering every direction you intend to support.
But the era of Large Language Models (LLMs) like Llama-2 has shaken the foundations of this dogma. These models have read terabytes of text during pre-training. They have likely seen German, Chinese, and Russian before they ever encounter a specific translation dataset.
This raises a fascinating question: If the model already knows the languages, do we really need millions of examples to teach it how to translate?
In a compelling paper titled “Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?”, researchers from Saarland University, the University of Edinburgh, and the Eastern Institute of Technology investigate this very question. They push the limits of data efficiency, asking if an LLM can learn to translate using just a handful of examples, or even data that is noisy and “misaligned.”
Their findings support a controversial theory known as the Superficial Alignment Hypothesis: the idea that a model’s core capabilities are learned during pre-training, and fine-tuning is merely a superficial process of teaching the model how to interact with the user.
In this post, we will tear down their methodology, analyze their surprising results, and explore what this means for the future of multilingual AI.
The Foundation: Supervised Fine-Tuning (SFT)
Before diving into the experiments, let’s establish the mechanism being tested. The researchers focus on Supervised Fine-Tuning (SFT).
In the context of LLMs, SFT is the process of taking a base model (like Llama-2-7b) and training it further on a specific dataset of Instruction-Input-Output triplets. For translation, the input is a source sentence wrapped in a prompt (e.g., “Translate the following to German:”), and the output is the target translation.
Mathematically, the goal is to optimize the model parameters \(\theta\) to maximize the probability of the target translation \(T\) given the source \(S\) and the instruction \(\mathcal{I}\).

As shown in the equation above, the model minimizes the negative log-likelihood of the target tokens. This is standard practice. The novelty of this paper lies not in how they train, but what they train on—specifically, how little data they can get away with.
Experiment 1: The “Less is More” Approach to Data Quantity
The first pillar of the traditional translation dogma is volume. State-of-the-art NMT systems are usually trained on millions of sentence pairs. The researchers decided to see how low they could go.
They took the Llama-2 7B model and fine-tuned it on WMT (Workshop on Machine Translation) datasets. They subsampled the data into sizes that are powers of 2, ranging from a microscopic 16 examples up to 4,096 examples.
They compared their Fine-Tuned (SFT) models against two baselines:
- Instruction-Tuned LLMs (IT-LLM): Models like Vicuna or Llama-2-Chat that are already tuned for general tasks.
- In-Context Learning (ICL): Giving the base model a few examples (1-shot or 3-shot) in the prompt without updating weights.
The 32-Sample Surprise
The results were stark.

Looking at Figure 1 above, observe the blue triangles (SFT-MT).
- The Jump: With just 32 training examples, the SFT model’s performance (measured in COMET and BLEU scores) jumps significantly, matching or outperforming general-purpose models like Llama-2-Chat (the pink diamonds).
- Diminishing Returns: While adding more data does improve performance, the curve flattens quickly. The gain from 32 examples to 1,024 examples is noticeable, but increasing from 1,024 to the full 75,000 dataset yields only marginal improvements.
What does this tell us? It strongly suggests that the 32 examples aren’t “teaching” the model German or Chinese grammar. The model already knows that. Those 32 examples are simply defining the task. They act as a signal telling the model: “Stop acting like a text completion engine and start acting like a translator.” This is strong evidence for the Superficial Alignment Hypothesis.
Experiment 2: The “One for All” Approach to Directionality
The second pillar of translation dogma is diversity. If you want a model that translates Chinese to English (\(zh \to en\)) and German to English (\(de \to en\)), you usually train on both.
The researchers challenged this by fine-tuning Llama-2 on only one translation direction (e.g., only German to English) and then testing it on 11 different directions.
The Cross-Lingual Transfer
The heatmap below visualizes how well a model trained on a specific direction (y-axis) performs across various test directions (x-axis). The scores are normalized against a model trained on all directions.

Figure 2 reveals a fascinating pattern of Cross-Lingual Transfer:
- Targeting English (\(X \to en\)): Look at the rows for
de->enorzh->en. If you train the model to translate German to English, it becomes excellent at translating any known language to English. The model learns the format “Foreign Language \(\to\) English Output” and applies it universally. - The English-Centric Trap (\(en \to X\)): Things get tricky when English is the source. If you train on
en->de(English to German), the model performs decently on otheren->Xtasks, but not as well as the reverse. - Task Misinterpretation: The researchers found that if you train on
en->de, and then ask the model to translate English to Chinese, the model sometimes gets confused and outputs German instead of Chinese, or simply copies the English source.
The takeaway is that multilingual capabilities are unlocked via alignment. You don’t need to show the model every language pair. Training on a single direction aligns the model to the concept of translation, which it can then apply to other languages it learned during pre-training. However, avoiding English as the target side in training data helps prevent the model from over-biasing toward English generation.
Experiment 3: The Unseen Frontier
What about languages that Llama-2 didn’t see during pre-training? The researchers categorized languages into “Seen” (explicitly in the training corpus, like German) and “Unseen” (negligible presence, like Icelandic or Hausa).
They fine-tuned the model specifically on these unseen languages (\(en \leftrightarrow is\) and \(en \leftrightarrow ha\)) and compared it to a control model trained on German.

Figure 4 illustrates two key points:
- Modest Gains on Unseen Languages: Fine-tuning on Hausa (
en-ha) does improve performance on Hausa compared to a model trained on German (en-de). However, the translations are still often poor. SFT cannot magically teach a language the model has never seen. - No Catastrophic Forgetting: Critically, fine-tuning on Hausa does not hurt the model’s ability to translate the “Seen” languages (like Czech or Russian). The model retains its pre-trained knowledge even when tuned on outlier data.
This reinforces the idea that SFT is a formatting mechanisms. Even when using an obscure language to show the model the “translation format,” the model successfully unlocks its latent abilities for the dominant languages.
Experiment 4: The Danger of “Dirty” Data
In the real world, we rarely have pristine human-translated data for every language. We often rely on Synthetic Data—specifically back-translation (using another machine to generate training data).
The researchers simulated this by creating two types of noisy data:
- Sentence Noise: Using an existing MT model to translate sentences (decent quality, but synthetic).
- Word Noise: Literally translating word-by-word using a dictionary (terrible quality, grammatically broken).
They then tested two scenarios: putting this noise on the Source side vs. the Target side.
The “Literal Translator” Failure Mode
The results in Figure 5 are cautionary.

When noisy data is placed on the Target side (i.e., we are teaching the model to output broken sentences), performance tanks. This is especially true for Word Noise.
If you look at the bottom row of Table 2 below, you can see the disaster in action. When trained on word-level noise for English \(\to\) German, the model creates a “Literal” translation that mimics the broken grammar of the training data.

The “Knowledge Paradox”
Here is perhaps the most counter-intuitive finding of the paper: The model is more robust to noise in unseen languages than in seen languages.
When the researchers introduced noise into German (a “Seen” language), the model’s performance dropped sharply. The model knows German well, so it noticed the specific, weird patterns in the noisy data and overfit to them, learning to produce bad German.
However, when they introduced noise into Hausa (an “Unseen” language), the model was surprisingly resilient. Because the model doesn’t “know” Hausa, it couldn’t learn the intricate (and incorrect) patterns of the noise. Instead, it simply learned the high-level task: “Take Input A and produce Output B.” It ignored the noise because it didn’t understand the language well enough to be corrupted by it.
Conclusion: Alignment vs. Learning
This research offers a paradigm shift for students and practitioners of NLP.
- Data Efficiency: You do not need massive datasets to build a functional LLM-based translator. 32 high-quality examples can beat generic instruction-tuned models.
- The Role of SFT: Supervised Fine-Tuning, in this context, is not about teaching the language. It is about aligning the model. You are simply unlocking the probability distributions for translation that were already learned during the massive pre-training phase.
- Strategic Data Selection:
- One direction can unlock many.
- Avoid training only with English as the target, or the model might forget how to generate other languages.
- Be careful with synthetic data on the target side. If the model knows the language well, it will learn your mistakes.
The days of needing millions of sentence pairs to get a translation system off the ground may be fading. As LLMs become more capable, the engineer’s job shifts from “collecting all the data” to “curating the right few prompts” to guide the giant.
For those interested in the deeper arithmetic of these experiments, the paper provides extensive validation on the superficial alignment hypothesis, suggesting that for many LLM tasks, “less” really might be “more”—provided it is the right kind of “less.”
](https://deep-paper.org/en/paper/2404.14122/images/cover.png)