Introduction
In the current landscape of Artificial Intelligence, we are running into a bottleneck: high-quality human-generated data is becoming scarce and expensive. To circumvent this, the industry has turned to synthetic data—text generated by Large Language Models (LLMs) to train other LLMs. It is an appealing solution that promises infinite data at a fraction of the cost.
However, this solution treats datasets as static commodities. We tend to assume that if a “Teacher” model (like GPT-4 or a large LLaMa model) generates data, the “Student” model will simply learn to be smarter. But learning is not just about facts and reasoning capabilities; it is also about style, bias, toxicity, and preference. When a student model trains on synthetic data, it inherits a complex web of latent characteristics from the teacher.
This brings us to a critical question posed by researchers from Cohere For AI: If models inherit properties from their data, can we control which properties they inherit?
In the paper “LLM See, LLM Do,” the authors conduct a comprehensive study on two phenomena:
- Passive Inheritance: The unintended side effects of training on synthetic data (e.g., increased toxicity or bias).
- Active Inheritance: A novel method to explicitly steer models toward desirable traits—like higher lexical diversity or lower toxicity—by carefully curating the synthetic data generation process.
This post breaks down their findings, explaining how “you are what you eat” applies to LLMs, and how we can put these models on a strict diet to improve their behavior without complex Reinforcement Learning.
The Context: Learning from Synthetic Data
Before diving into the experiments, we need to understand the standard setup for training on synthetic data, often called Knowledge Distillation.
In a typical scenario, you have a Teacher Model (usually a large, powerful LLM) and a Student Model (often smaller). You feed prompts to the Teacher, it generates answers (synthetic data), and you fine-tune the Student on those answers. The goal is for the Student to mimic the Teacher’s performance.
Mathematically, the student parameters \(\theta\) are optimized to maximize the likelihood of the teacher’s generated text \(\hat{y}\) given a prompt \(x\). This is often described as behavioral cloning.

The problem with this standard approach is that it is a “bulk deal.” The Student learns the Teacher’s reasoning, but it also learns the Teacher’s bad habits, biases, and stylistic quirks. Most optimization objectives only care about predicting the next token correctly, not whether that token contributes to a non-differentiable objective like “politeness” or “creativity.”
Part 1: Passive Inheritance — The Unintended Consequences
The authors first set out to profile exactly what happens when you train a model on synthetic data without any filtering. They termed this Passive Inheritance.
They experimented with various combinations of models, using LLaMa2-7B, LLaMa2-13B, and Mixtral-8x7B as both teachers and students. They used prompts from the Alpaca dataset (neutral, general-purpose instructions) to generate synthetic training data.
The Profiling Toolbox
To measure the changes, the researchers didn’t just look at accuracy benchmarks. They compiled a “Profiling Toolbox” containing over 26 metrics across four categories.

As shown in Table 1, they looked at:
- Textual Characteristics: Length, readability (Gunning-Fog index), and lexical diversity (MTLD).
- Social Bias: Stereotypes regarding race, gender, religion, etc.
- Toxicity: The probability of generating harmful content.
- Calibration: How well the model knows what it doesn’t know.
The Results of Passive Inheritance
The findings were surprising. Even though the Alpaca prompts are generally neutral, the fine-tuned student models shifted drastically in their behavioral profiles.

1. Toxicity Increases: Perhaps the most alarming finding (shown in the rightmost charts of Figure 2) is that toxicity metrics often got worse after fine-tuning on synthetic data. In some cases, toxicity increased by up to 40%. The researchers hypothesize that fine-tuning on utility-oriented data (like instruction following) might cause the model to “forget” some of its initial safety alignment, a phenomenon known as the alignment tax or catastrophic forgetting.
2. The “Length Explosion”: Look at the middle chart in Figure 2. The “Length” metric shows massive increases—over 100% in some cases. When models train on synthetic data, they tend to become much more verbose. This aligns with other research suggesting LLMs have a “verbosity bias,” often equating longer answers with better answers.
3. Unpredictable Bias Shifts: Social bias (left chart) didn’t follow a clean pattern. Training on a teacher didn’t necessarily mean the student adopted the teacher’s exact bias profile. Sometimes bias decreased; sometimes it spiked, particularly in specific categories like disability status. This highlights that passive inheritance is volatile; you cannot easily predict the social behavior of the student based solely on the teacher.
Preferences and the “Echo Chamber”
The researchers also investigated LLM-as-a-Judge. It is becoming common to use strong LLMs to evaluate weaker ones. But does training on synthetic data affect what a model “likes”?

As Figure 3 illustrates, models definitely have a “type.” When a student is trained on a specific teacher’s data, its preferences align more closely with that teacher (the orange and blue lines shifting).
This creates a risk of circularity. If we use GPT-4 to generate data to train a model, and then use GPT-4 to evaluate that model, we are essentially creating an echo chamber where the model is rewarded for mimicking the specific quirks of its teacher, potentially drifting away from human preferences (the grey dashed line).
Part 2: Active Inheritance — Steering the Model
The volatility of passive inheritance raises a question: If models are so sensitive to the properties of their training data, can we exploit this to our advantage?
Instead of passively accepting whatever the teacher outputs, the authors propose Active Inheritance. This method involves generating multiple candidate responses for every prompt and filtering them based on a desired attribute before training.
The Method: Targeted Sampling
The process is straightforward but powerful. It relies on a “Best-of-K” or rejection sampling strategy during the data creation phase.
- Generate: For a single prompt \(x\), generate \(k\) different responses using one or multiple teacher models.
- Score: Use a profiling function \(f\) (from the toolbox mentioned earlier) to score each response. This function can be anything: a toxicity classifier, a lexical diversity counter, or a length metric.
- Select: Keep only the response that maximizes (or minimizes) the score.
- Train: Fine-tune the student model on this curated dataset.
Mathematically, the probability of selecting a sample changes from a uniform distribution to a deterministic selection of the best candidate:

This approach allows optimization for non-differentiable objectives. You cannot easily write a loss function for “lexical diversity” and backpropagate it through a neural network. But you can easily measure it in the output and filter for it.
Does It Work?
The short answer is yes. The authors tested this by targeting three specific attributes: increasing Length, increasing Lexical Diversity (vocabulary richness), and decreasing Toxicity.

Figure 1 summarizes the results beautifully. Compared to a random baseline (green bars), the Active Inheritance method (blue bars):
- Boosted Length by ~115%.
- Increased Diversity by ~40%.
- Decreased Toxicity by ~30% (whereas random sampling actually increased it).
Detailed Results
Let’s look closer at the numbers. The authors compared “Single-source” (generating samples from one model) and “Multi-source” (generating samples from a diverse pool of models like Command-R+, Gemma, and Aya).

The Toxicity Win: Table 3 shows a dramatic reduction in toxicity. By simply generating multiple options and throwing away the toxic ones before training, the student model learned to be safer. For LLaMa2-7B (Multi-source), toxicity dropped from a score of 71.7 down to 42.7. This was achieved without complex Reinforcement Learning from Human Feedback (RLHF)—just clever data curation.
The Diversity Boost: Lexical diversity (MTLD) also saw significant gains. This is crucial because a common criticism of LLMs is that they sound robotic or repetitive. Active inheritance forces them to learn from the most linguistically rich examples available.
Single vs. Multi-Source: The Wisdom of the Crowd
Is it better to ask one teacher ten times, or ten different teachers once?

The results (Figure 6) suggest that Multi-source sampling generally leads to better outcomes, particularly for length and toxicity. Accessing a diverse pool of “thoughts” from different architectures (the “wisdom of the crowd”) provides a richer search space for the best training data. However, even the Single-source strategy (asking LLaMa2 ten times) significantly outperformed the baseline, proving that even a single model produces enough variance in its outputs to allow for optimization.
Quantity vs. Quality
How many samples do we need to generate to see a benefit? Is generating 5 candidates enough, or do we need 25?

Figure 4 reveals an interesting nuance.
- For Length (Right): There isn’t a huge difference between filtering from 5 samples vs. 25 samples. The model easily picks up on the “be longer” signal.
- For Diversity (Left): The size of the pool matters. The blue bar grows significantly as we move from 5 to 25 samples. Finding a truly linguistically diverse response is harder, so having more candidates increases the likelihood of finding a “gem” to train on.
Conclusion & Implications
The “LLM See, LLM Do” paper fundamentally challenges the passive view of synthetic data. It demonstrates that datasets are not static repositories of information; they are malleable tools that can shape the behavior of AI models in precise directions.
Key Takeaways:
- Passive Risks: blindly training on synthetic data is risky. It can inadvertently increase toxicity and cause unpredictable shifts in social bias.
- Active Control: We don’t need complicated RL pipelines to steer model behavior. Simple, metric-based filtering of synthetic data (Active Inheritance) is a highly effective control mechanism.
- Non-Differentiable Goals: Attributes that are hard to train for mathematically (like “being interesting” or “being safe”) are easy to train for via data selection.
This research democratizes model alignment. It suggests that you don’t need a massive team of human annotators or a complex PPO (Proximal Policy Optimization) setup to improve a model. You just need to be a picky eater. By rigorously selecting what our models consume, we can actively shape who they become.
](https://deep-paper.org/en/paper/file-3254/images/cover.png)