Introduction

In the rapidly evolving landscape of Large Language Models (LLMs), there is a distinct imbalance. While models like GPT-4 and Llama 2 dazzle us with their capabilities, they are predominantly “English-centric.” They are trained on vast oceans of English text, and their ability to follow instructions in other languages often feels like an afterthought—a side effect of translation rather than a core feature.

But the world speaks more than just English. For an AI to be a truly global assistant, it must be “polyglot”—capable of understanding and generating fluent, culturally nuanced text in multiple languages.

The creation of an AI assistant typically involves two massive steps: Pre-training, where the model learns to predict the next word by reading terabytes of text, and Instruction-Tuning, where the model is fine-tuned to actually follow user commands (like “Write an email” or “Solve this equation”).

A group of researchers from the Lamarr Institute, Fraunhofer IAIS, TU Dresden, and FZ Jülich recently tackled a critical question that has been lingering in the NLP community: If we have a pre-trained model that knows multiple languages, how should we fine-tune it? Do we need to give it instructions in every language, or can we just teach it in English and hope the skills transfer?

In their paper, Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand Multilingual Instructions?, the authors conduct the first extensive study on how different compositions of languages during instruction-tuning affect a model’s performance. Their findings challenge existing assumptions about how AI learns and offer a roadmap for building better multilingual assistants.

Background: The Alignment Gap

Before diving into the experiments, we need to understand the “Alignment Gap.” A pre-trained model is like a very well-read encyclopedia. It contains knowledge, but it doesn’t necessarily know how to be helpful. If you ask it “How do I bake a cake?”, a pre-trained model might just continue the sentence with “…is a question asked by many.” It predicts the next tokens.

Instruction-tuning bridges this gap. It aligns the model’s vast knowledge with the user’s intent.

The Superficial Alignment Hypothesis

A popular theory in recent years is the Superficial Alignment Hypothesis. Proposed by the creators of the LIMA dataset, this hypothesis suggests that a model learns almost all its knowledge and capabilities during pre-training. Instruction-tuning, therefore, is just “superficial”—it merely teaches the model the style or format of an assistant. According to this theory, you only need a tiny amount of high-quality examples (e.g., 1,000 instructions) to align a massive model.

The researchers in this study wanted to see if this hypothesis holds true for multilingual scenarios. Does a German-speaking model only need 1,000 German instructions to unlock its potential? Or does the complexity of multiple languages require a heavier hand?

The Data Problem

The biggest hurdle in multilingual AI is the lack of data. Most open-source instruction datasets are in English. To solve this, the researchers created two significant resources:

Lima-X: A high-quality, human-curated parallel corpus. They took the English LIMA dataset and translated/curated it for German, French, Italian, and Spanish.
MT-Bench-X: A benchmark for evaluating how well models follow instructions in these five languages.

The Core Method: Designing a Polyglot Experiment

To rigorously test how language composition affects learning, the researchers set up a series of experiments using two types of base models:

24EU-7B: A mid-sized, 7-billion parameter model pre-trained on a mix of 24 European languages.
Mixtral-8x7B: A larger, “Mixture of Experts” model known for strong performance.

The goal was to fine-tune these models using different “mixtures” of instruction data and see which recipe produced the best polyglot assistant.

The Dataset Cocktails

They utilized two primary data sources: Bactrian-X (a large-scale, synthetic dataset generated by ChatGPT) and the newly created Lima-X (small-scale, human-curated).

They created specific language mixtures to test their hypotheses. The notation they used (e.g., ENDEFRITES) represents the languages included: EN (English), DE (German), FR (French), IT (Italian), and ES (Spanish).

Monolingual: Training on just one language (e.g., Bactrian-DE).
ENDEFRITES (Parallel): Training on the dataset where every instruction exists in all five languages simultaneously. This increases the dataset size significantly (5x).
DEFRITES: Parallel training excluding English, to see if the dominant language is necessary.
ENDEFRITES-sampled: A mixture containing all languages, but down-sampled so the total number of examples equals the monolingual dataset. This tests if diversity of language is more important than the quantity of examples.

The Evaluation Benchmark: MT-Bench-X

Evaluating a chat assistant is difficult. There is no single “correct” answer to “Write a poem about autumn.” To solve this, the researchers used an LLM-as-a-judge approach.

They employed GPT-4 to act as a judge. They would feed a user question and the model’s response to GPT-4, and ask GPT-4 to score the response on a scale of 1-10 or compare it to another model’s answer. This was done across all target languages using the newly created MT-Bench-X.

Figure 4: GPT-4-as-a-judge single evaluation average scores for each language mix dataset variant on MT-Bench-X.

Figure 4 above provides a high-level view of the absolute scores achieved by different model configurations. Note how the Mixtral-8x7B models (bottom row) generally achieve higher scores (green/yellow zones) compared to the smaller 24EU-7B models (top row), illustrating the impact of base model size.

Experiments & Results

The study produced several fascinating insights that refine our understanding of how AI learns languages.

1. Parallel Data Wins (The Cross-Lingual Boost)

The most significant finding is that parallel instruction-tuning (training on the same instructions translated across multiple languages) is superior to monolingual training.

If you want a model to be good at German, you might think, “I should just train it on German instructions.” Surprisingly, this study shows that training it on German, English, French, Italian, and Spanish simultaneously makes it better at German than if it had only seen German data.

Figure 1: Percentage improvement for turn one averaged across MT-Bench-X languages of models fine-tuned on parallel mixed language instruction-tuning datasets over single language fine-tunings.

As shown in Figure 1 above, models fine-tuned on parallel datasets (like ENDEFRITES) consistently outperformed their monolingual counterparts. The chart shows the percentage improvement of multilingual fine-tuning over single-language fine-tuning.

For the Mixtral-8x7B model (Figure 1b), the bars are almost entirely positive, showing consistent gains.
For the 24EU-7B model (Figure 1a), the Bactrian-X (BX) dataset shows massive improvements (the purple bar reaches nearly 10%).

This suggests that learning to follow an instruction in French reinforces the neural pathways required to follow that same instruction in Italian. The concept of the task transfers across language boundaries.

2. The “Superficial Alignment Hypothesis” is Conditional

The study provided a nuance to the famous Superficial Alignment Hypothesis.

For the large model (Mixtral): The hypothesis held up well. The model performed very well even with small amounts of high-quality data (Lima-X), suggesting it already “knew” how to be multilingual and just needed a nudge.
For the mid-sized model (24EU-7B): The hypothesis failed. The model struggled with the small, curated Lima-X dataset. It performed significantly better with the large, synthetic Bactrian-X dataset.

This indicates that smaller or less capable models need more repetitions (larger datasets) to learn multilingual instruction following. They cannot rely solely on pre-training to bridge the gap; they need a more rigorous fine-tuning phase.

3. Detailed Capability Analysis

The researchers didn’t just look at overall scores; they broke down performance by category (Coding, Reasoning, Math, Roleplay, etc.).

Figure 5: In-depth MT-Bench-X quality assessment by GPT-4.

Figure 5 utilizes radar charts to visualize these capabilities.

Left (a) & Right (b): You can see the comparison between different fine-tuning strategies on English (EN) and German (DE) benchmarks.
Notice the tiny dotted black loop in the center? That represents the pre-trained base model. It scores very low on almost everything, proving that pre-training alone is useless for being an assistant.
The colored lines represent the fine-tuned models. The Bactrian-EN and Lima-EN models (solid lines) generally stretch further out, indicating better performance. However, on the non-English charts (like DE), the multilingual mixtures (e.g., Bactrian-DEFRITES) often encompass a larger area than the monolingual ones, reinforcing the benefit of mixed-language training.

4. Absolute Performance

While the relative improvements were clear, it is worth looking at the absolute scores to understand the current state of the art for these specific setups.

Figure 7: Absolute cross-lingual MT-Bench-X scores across all five languages for turn one.

Figure 7 highlights a stark contrast between the two base models used. The Mixtral-8x7B (right side) achieves significantly higher absolute scores (ranging from 6 to 8) compared to the 24EU-7B model (left side, ranging from 2 to 5). This confirms that no amount of clever instruction-tuning can fully compensate for a less capable base model. However, within the 24EU-7B cluster, we again see that the ENDEFRITES (all languages) configuration pushes the score higher than the sampled or partial mixtures.

Human Evaluation vs. GPT-4

A critical part of this paper was verifying the metric itself. Can we actually trust GPT-4 to judge other AI models? To find out, the researchers conducted a human study where experts evaluated the model responses.

Figure 8: User interface for human evaluation.

The researchers built a custom interface (shown in Figure 8) where human judges were presented with a user question and two anonymous model answers. They had to pick the winner, declare a tie, or declare “both bad.”

Disagreement in the Ranks

The results showed a discrepancy between human and machine judges.

Figure 2: Pair-wise MT-Bench-DE quality assessment by humans and GPT-4, including voting option “both bad”

Figure 2 reveals a fascinating divergence:

GPT-4 (Right chart): It was very decisive. It declared Bactrian-ENDEFRITES (green) the winner in the vast majority of cases against the monolingual Bactrian-DE.
Humans (Left chart): Humans were more critical. While they still preferred the multilingual model (green), they voted “Tie” (blue) or “Both Bad” (dark grey) much more often than GPT-4 did.

This suggests that while GPT-4 correlates with human preference, it may be overly optimistic or lenient, failing to catch nuances that make an answer “bad” in the eyes of a human native speaker. Specifically, GPT-4 struggled to identify when both models failed at complex reasoning or math tasks, often forcing a “winner” where there shouldn’t be one.

Positional Bias

The researchers also highlighted a known issue with LLM judges: Positional Bias. LLMs tend to prefer the first answer they read.

Table 1: Average percentage of positional bias (PB).

As seen in Table 1, the bias is significant. In categories like “STEM” (Science, Tech, Engineering, Math), the positional bias reached 30%. This means the order in which the answers were presented influenced the score by nearly a third. The researchers had to use careful debiasing techniques (swapping positions and averaging) to get reliable results.

Conclusion & Implications

The paper “Investigating Multilingual Instruction-Tuning” provides a crucial piece of the puzzle for the democratization of AI. It moves us away from the idea that we can simply build English models and “translate” the rest.

Key Takeaways:

Polyglot models do demand polyglot instructions. You cannot maximize a model’s performance in German or Spanish solely by training it in English.
Parallelism is power. Training on parallel datasets (multiple languages at once) creates a synergy that boosts performance across all involved languages. It improves the model’s underlying ability to follow instructions, independent of the language used.
Size matters for data efficiency. The “Less is More” approach (Superficial Alignment) works for massive models, but mid-sized models (which are cheaper to run and more accessible) still need large, robust instruction datasets to learn effectively.
Trust, but verify. Automated benchmarks like MT-Bench-X are useful for rapid iteration, but they are not a perfect proxy for human judgment. They suffer from bias and leniency.

This research paves the way for more efficient training pipelines. Instead of creating siloed datasets for every language, the community should focus on creating high-quality, parallel corpora. By doing so, we can ensure that the next generation of AI assistants is helpful not just in Silicon Valley, but in Dresden, Paris, Rome, and Madrid as well.

Introduction#

Background: The Alignment Gap#

The Superficial Alignment Hypothesis#

The Data Problem#

The Core Method: Designing a Polyglot Experiment#

The Dataset Cocktails#

The Evaluation Benchmark: MT-Bench-X#

Experiments & Results#

1. Parallel Data Wins (The Cross-Lingual Boost)#

2. The “Superficial Alignment Hypothesis” is Conditional#

3. Detailed Capability Analysis#

4. Absolute Performance#

Human Evaluation vs. GPT-4#

Disagreement in the Ranks#

Positional Bias#

Conclusion & Implications#