I Learn Better If You Speak My Language: Why Synthetic Data Beats Human Gold-Standard in LLM Training

In the rapidly evolving world of Large Language Models (LLMs), there is a widely accepted hierarchy of data quality. At the top sits human-annotated data—the “gold standard”—carefully crafted by experts. Below that is synthetic data generated by models, often viewed as a useful but slightly inferior substitute when human data is scarce.

But what if that hierarchy is wrong?

A fascinating research paper titled “I Learn Better If You Speak My Language” explores a counter-intuitive phenomenon: fine-tuning a small LLM (like Mistral or Llama-2) on responses generated by other LLMs (like GPT-4) often yields better results than fine-tuning on human-written responses.

This isn’t just about GPT-4 being “smarter” than the average human annotator. The researchers discovered that the style of the language matters just as much as the content. Specifically, models learn better from data that feels “familiar” to them—data that speaks their language.

In this deep dive, we will unpack this paper to understand why synthetic data is proving so effective, explore the concept of “perplexity” as a measure of familiarity, and look at a novel training method called “Minimum Change” that maximizes this effect.

The Paradox of Synthetic Data

To understand the core contribution of this paper, we first need to look at the standard practice of Supervised Fine-Tuning (SFT). Typically, if you want to teach a model like Llama-2-13B to solve math problems, you feed it a dataset of questions and correct, human-verified answers.

However, recent trends have shifted toward “distillation.” This involves taking a massive, powerful model (the teacher, e.g., GPT-4), asking it to solve problems, and using those generated answers to train a smaller model (the student).

The researchers observed a consistent pattern: models trained on this synthetic data were outperforming models trained on the original human datasets.

Table 1: Human-annotated data Vs. data generated directly by GPT-4/Claud 3.5.In-domain performance is highlighted in grey. Data points are highlighted when accuracy is more than 15% below the highest accuracy on the same dataset and model. There are 14,1,and 2 red data points for Groundtruth,GPT-4,and Claude,respectively.

As shown in Table 1 above, look at the prevalence of red highlights in the “Groundtruth” rows. These indicate low performance. Across various domains—Math, Commonsense Reasoning (ECQA), and Code generation—the models trained on human ground truth consistently lagged behind those trained on GPT-4 or Claude generated responses.

The “Chain of Thought” Myth

Why is this happening? The prevailing wisdom in the AI community has been that LLMs like GPT-4 are simply more verbose. They naturally produce “Chain of Thought” (CoT) reasoning—breaking down problems step-by-step—whereas human annotators might just provide the answer or a brief explanation.

The assumption was: More Details = Better Learning.

However, the authors of this paper challenged that assumption. They found instances where more detailed responses didn’t lead to better training. They realized that detail alone couldn’t explain the performance gap. There had to be a hidden variable.

The Hidden Variable: Familiarity and Perplexity

The researchers proposed a new hypothesis: Familiarity.

Imagine you are trying to learn a complex topic. You would likely learn faster if the teacher explained it using vocabulary and sentence structures you already know, rather than using archaic phrasing or slang you’ve never heard.

The researchers hypothesized that LLMs work the same way. A “Target LLM” (the one being trained) has an inherent preference for the way it (and other LLMs) speaks.

To measure this “familiarity,” they used a metric called Perplexity.

What is Perplexity?

In natural language processing, perplexity measures how “surprised” a model is by a sequence of text.

Low Perplexity: The model predicts the text easily. It “expected” those words. The text feels familiar.
High Perplexity: The model finds the text unpredictable or “weird.”

The researchers measured the perplexity of different datasets as viewed by the target models (Mistral-7B and Llama-2-13B).

Figure 2: Average Perplexity Comparison

Figure 2 presents striking evidence for the familiarity hypothesis.

Look at the Grey bars (Groundtruth/Human). They are consistently the highest. This means the models find human language the most “surprising” or difficult to predict.
Look at the Green/Orange bars (GPT-4 and Claude). The perplexity is significantly lower. Even though these are different models, they share a “statistical dialect” with the target models.
Look at the Purple/Blue bars (Self-Prediction). The perplexity is lowest when the model reads its own outputs, which makes perfect sense.

The correlation is clear: LLM-generated text has lower perplexity than human text. The models are “speaking the same language.”

Investigating the Hypothesis: Is it Detail or Familiarity?

To prove that familiarity (low perplexity) drives performance—and not just the extra details provided by GPT-4—the researchers designed a series of clever ablation studies.

Experiment 1: Does Style Matter More Than Detail?

They created several variants of training data using GPT-4:

GPT-4 Answer Directly: Standard synthetic data.
GPT-4 Step-by-Step: Explicitly forcing detailed reasoning.
GPT-4 Transforming Ground Truth: Asking GPT-4 to rewrite the human answer in its own detailed style.
Rewrite Ground Truth: Asking GPT-4 to rewrite the human answer but keep the human logic/style as much as possible.

Table 2: Performance comparison of models trained on data constructed using different methods. n_train = 1000 Data points are labeled as low performance when accuracy is more than 15% below the highest accuracy on the same dataset using the same model. See Table 6and Table 7 for additional experiments with GPT-4 and Claude 3.5.

Table 2 reveals the results. Notice that “GPT-4 Answer Directly” (often shorter and more direct) frequently performs as well as, or better than, the complex step-by-step transformations.

Crucially, simply adding details to human ground truth (Step-by-step transformation) didn’t always yield the best results. The “Direct” answers from GPT-4, which flow naturally from the model’s distribution, were extremely effective despite being shorter in token length than the detailed variants. This suggests that the naturalness of the text (familiarity) is a key driver of success.

Experiment 2: The Perplexity Split

To isolate familiarity completely, the researchers performed a controlled experiment. They generated two sets of answers using GPT-4 that were semantically identical (same meaning) but differed in phrasing:

Lower Perplexity Set: Phrasing that the target model found predictable.
Higher Perplexity Set: Phrasing that the target model found surprising.

Table 3: GPT-4/Claude 3.5: Answers with Lower Perplexity vs. Higher Perplexity. n_train = 1000

The results in Table 3 are definitive. Training on the Lower Perplexity data consistently yielded better accuracy (e.g., 0.600 on GSM8K vs 0.547 for higher perplexity on Llama2).

Remember, the information content was the same. The only difference was how familiar the language style was to the student model. This confirms the paper’s title: The model learns better if you speak its language.

The “Minimum Change” Method

The researchers had established two things:

Models prefer their own outputs (highest familiarity/lowest perplexity).
However, smaller models (like Llama-2) are often wrong, so we can’t just train them on their own raw predictions (self-training) without filtering, or they will learn errors.

This led to the development of a practical technique called “Minimum Change.”

The Concept

The goal of Minimum Change is to get the best of both worlds:

High Familiarity: Keep as much of the target model’s original prediction as possible.
High Correctness: Use a stronger model (GPT-4) to fix only the logical errors.

Instead of asking GPT-4 to write the answer from scratch (which creates GPT-4 style text), they ask GPT-4 to act as an editor.

Figure 3: Minimum Change Data Correction Examples

As illustrated in Figure 3, the pipeline works like this:

Initial Prediction: The student model (e.g., Mistral) tries to answer the question. It might get the math wrong.
Minimum Change Correction: GPT-4 reads the student’s attempt. It is instructed to fix the math error but change as few words as possible.
Fine-Tuning: The student model is then trained on this corrected version.

Because the text originated from the student model, it remains highly familiar (low perplexity). But because GPT-4 edited it, it is factually correct.

The Prompt

How do you get an LLM to “edit” without rewriting? You have to be very specific in the prompting.

Figure 4: Minimum Change Prompt Example

Figure 4 shows the prompt used. The key instruction is: “The minimum changed prediction corrects the mistakes and keeps as much original words as possible.”

Why Not Just Use the Target Model to Fix Itself?

You might wonder, why involve GPT-4 at all? Why not ask Llama-2 to rewrite the human ground truth into its own style?

The researchers tried this (calling it “Groundtruth Style Transfer”), but it failed. Smaller models often lack the instruction-following capability to rewrite text without breaking the logic or hallucinating.

Figure 5: Llama2 groundtruth style transfer failure example Figure 6: Groundtruth Tranformation Prompt

Figure 5 shows an example of this failure. When Llama-2 tries to rewrite the ground truth, it sometimes drifts away from the correct logic. GPT-4 is necessary as a reliable “supervisor” to ensure the logic stays sound while the style remains familiar.

The Results

So, how does the Minimum Change method stack up against standard synthetic data?

Table 5: Comparing the experimental results of GPT4 and minimum change. n_train = 1000

Table 5 compares the “Minimum Change” method against “GPT-4 Answer Directly.”

Performance: The Minimum Change method achieves comparable (and sometimes superior) performance to direct GPT-4 data.
Efficiency: Look at the “Avg Token Length” column. The Minimum Change responses are significantly shorter (e.g., 133 tokens vs 164 tokens).

This is a massive win. The model achieves top-tier accuracy using training data that is concise and computationally cheaper to process, simply because that data is “familiar” to it.

Conclusion and Implications

This research shifts our understanding of why synthetic data is so effective. It is not merely about distilling knowledge from a smarter teacher to a dumber student. It is about translation.

Human language is diverse, messy, and high-perplexity. LLM language is statistical, predictable, and low-perplexity. When we force an LLM to learn from human gold-standard data, we are asking it to bridge a “language barrier.”

The key takeaways for students and practitioners are:

Don’t Obsess Over Details: Simply adding Chain-of-Thought reasoning isn’t a magic bullet if the style of the reasoning is alien to the model.
Perplexity Matters: When curating datasets, consider how “surprising” the data is to your model. Lower perplexity (without sacrificing accuracy) facilitates faster and more robust learning.
The “Minimum Change” Hybrid: The most effective training data might be the model’s own outputs, lightly corrected by a superior intelligence. This preserves the “statistical dialect” of the model while ensuring truthfulness.

As we move forward, we may see a shift away from human annotation towards “Human-Verified, Model-Generated” workflows, where the primary role of humans (or advanced AI supervisors) is to fact-check the model’s own hallucinations rather than writing the answers from scratch.

This paper suggests a future where AI models don’t just learn from us—they learn best when we let them speak to themselves, with just a little bit of guidance.

The Paradox of Synthetic Data#

The “Chain of Thought” Myth#

The Hidden Variable: Familiarity and Perplexity#

What is Perplexity?#

Investigating the Hypothesis: Is it Detail or Familiarity?#

Experiment 1: Does Style Matter More Than Detail?#

Experiment 2: The Perplexity Split#

The “Minimum Change” Method#

The Concept#

The Prompt#

Why Not Just Use the Target Model to Fix Itself?#

The Results#

Conclusion and Implications#