When Students Learn Too Much - Memorization and Hallucination in NMT Knowledge Distillation

In the world of Neural Machine Translation (NMT), bigger is almost always better. Large models with billions of parameters consistently outperform their smaller counterparts in translating complex languages. However, in production environments—like the translation app on your phone—deploying these massive models is impractical due to latency and memory constraints.

To solve this, the industry relies heavily on Sequence-Level Knowledge Distillation (SeqKD). This technique involves a large “Teacher” model teaching a smaller “Student” model how to translate. Ideally, the Student learns to generalize like the Teacher while remaining lightweight.

But does the Student only learn the good parts?

New research suggests that Students might be picking up the Teacher’s bad habits—specifically, the tendency to memorize training data verbatim and hallucinate content. In this deep dive into the paper “Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation,” we will explore how privacy risks and model failures are transmitted during distillation, and why Students might actually be more prone to memorization than we previously thought.

The Problem: Memorization and Privacy

Before dissecting the solution, we must understand the problem. NMT models are trained on massive datasets scraped from the web. These datasets are noisy and often contain sensitive information (names, addresses, or copyrighted text).

Extractive Memorization (ExMem) occurs when a model reproduces long sequences of training data verbatim, even when prompted with only a short prefix. This poses a severe privacy risk. If a model has memorized a specific sentence containing PII (Personally Identifiable Information), a malicious actor could extract it.

Furthermore, models often suffer from Hallucinations—generating text that is fluent but completely untethered from the source input.

The prevailing wisdom in computer vision has been that Knowledge Distillation (KD) acts as a regularizer, inhibiting memorization. The authors of this paper set out to see if this holds true for NMT. They asked a critical question: How does instance-level memorization get inherited by the Student in SeqKD?

The Experimental Setup

To isolate the effects of distillation, the researchers set up a comparative study using three distinct models across five language pairs (including German-English and Polish-English):

The Teacher (\(\theta_T\)): A large Transformer model trained on the original dataset (Source \(\mathcal{S}_C\) and Target \(\mathcal{T}_C\)).
The Student (\(\theta_S\)): A smaller model trained via SeqKD. It sees the original Source \(\mathcal{S}_C\), but its training targets are the translations generated by the Teacher (\(\mathcal{T}_T\)), not the original human translations.
The Baseline (\(\theta_B\)): A model of the exact same size and architecture as the Student, but trained on the original dataset (\(\mathcal{S}_C\) and \(\mathcal{T}_C\)).

The Baseline is the crucial control variable. It allows us to ask: “Did the Student memorize this because it’s a small model, or because it was trained via distillation?”

Establishing Competence

First, we need to confirm that SeqKD works as intended. Does the Student actually perform better than the Baseline?

Performance of teacher, student and baseline models for four model quality metrics.

As shown in Figure 2, the standard NMT hierarchy holds true. The Teacher (purple circles) achieves the highest BLEU scores. Crucially, the Student (purple triangles) consistently outperforms the Baseline (teal crosses) across all language pairs. This confirms that the Student is indeed learning superior translation capabilities from the Teacher.

Finding 1: SeqKD Facilitates Memorization

This is where the results become counter-intuitive. In theory, because the Student is trained on synthetic data generated by the Teacher, it never directly sees the original ground-truth targets. One might expect this to act as a privacy filter.

However, the data tells a different story.

The researchers measured Extractive Memorization (ExMem)—the rate at which the model completes a sequence verbatim from the training data given a prompt.

Memorization metrics for Teacher, Student and Baseline.

Figure 3 reveals a startling trend:

Graph (a): The Student replicates the original training corpus (\(\mathcal{T}_C\)) significantly more than the Baseline.
Graph (b): When looking specifically at ExMem rates, the Student is drastically higher—showing a 57% increase in extractive memorization compared to the Baseline.

This suggests that SeqKD does not filter out memorization; it amplifies it. Even though the Student sees a “denoised” version of the data (the Teacher’s output), it manages to memorize the underlying sensitive data more aggressively than if it had just been trained on the raw data itself.

Primary vs. Secondary Memorization

The study distinguishes between two types of memorization inherited by the Student:

Primary ExMem: The Student memorizes the original training data.
Secondary ExMem: The Student memorizes the Teacher’s specific output.

This leads to “Secondary ExMem,” where the Student learns to replicate the Teacher’s specific hallucinations or errors. For example, if the Teacher hallucinates a URL that wasn’t in the source text, the Student learns to produce that same URL when prompted, effectively hard-coding the Teacher’s mistakes.

Finding 2: Hallucination Inheritance

If memorization is the retention of real data, hallucination is the fabrication of fake data. The researchers categorized hallucinations into two types:

Natural Hallucinations (NatHal): Fluent but incorrect translations.
Oscillatory Hallucinations (OscHal): The model gets stuck in a loop, repeating a phrase over and over (e.g., “The The The The…”).

Hallucination metrics for Teacher, Student and Baseline.

Figure 4 highlights the severity of this issue. Look at the OscHal metric (a). The Student models (blue bars in the bottom-left chart) show a massive spike in oscillatory hallucinations compared to the Teacher. More worryingly, the top-left chart shows that Students often hallucinate more than the Baseline (an increase of roughly 31%).

This indicates that SeqKD degrades the model’s robustness. While the Student is better at general translation (higher BLEU), it is more brittle and prone to catastrophic failure modes like infinite loops.

Deep Dive: Subgroup Analysis

To understand why this happens, the researchers broke down the training data into subgroups based on quality and difficulty. They used Counterfactual Memorization (CM) to measure how “hard” a specific example is.

The CM score is calculated using the difference between a model’s performance when it includes a specific example in training versus when it excludes it:

Equation for Counterfactual Memorization.

A high CM score implies the model needs to memorize that specific example to get it right (often because it’s an outlier or a rare phrase). A low CM score implies the example fits general patterns easily.

The Paradox of Amplified Denoising

The study found a fascinating phenomenon when analyzing low-quality data. NMT datasets often contain misaligned pairs (where the source and target don’t match).

Relative increases comparing students and baselines to the teacher models.

Figure 13 (referenced as Figure 6 in the original text structure) illustrates the “Comet-QE-22” metric, which estimates translation quality without references.

For the lowest quality subgroups (conf ↓ or low quality bins), the Student actually outperforms the Teacher. This is called Amplified Denoising.

The Teacher sees noisy data but manages to filter some of it out during generation (because it generalizes well).
The Student trains on the Teacher’s cleaner output.
Consequently, the Student learns to ignore the noise better than the Teacher did, and much better than the Baseline (which trained on the noisy raw data).

This explains why the Student is generally better (higher BLEU) but also reveals why it’s dangerous. The Student is hyper-optimized to the Teacher’s outputs. On standard data, this is great. On edge cases or hallucinations, the Student commits to the error with high confidence.

Illustration of how subgroups vary in replication.

Figure 5 further supports this. In graph (a), we see that for low-quality data (low R value), replication rates are low. The Student (blue triangles) replicates the corpus less than the Teacher, confirming it is denoising the bad data. However, for high-quality data, the replication shoots up.

The Solution: Adaptive-SeqKD

Identifying the problem is only half the battle. The authors propose a modified training pipeline called Adaptive-SeqKD.

The hypothesis is simple: If the Student learns the Teacher’s bad habits (hallucinations and memorization of outliers), we should “clean up” the Teacher before distillation.

The Adaptive-SeqKD Process:

Identify a subset of “high-quality” data from the training set. The researchers used intrinsic metrics: examples where the Teacher was highly confident and the translation was not too short.
Briefly finetune the Teacher on only this high-quality subset.
Use this refined Teacher to generate the synthetic targets for the Student.

Does it work?

Performance changes observed for the different language pairs when applying Adaptive-SeqKD.

Figure 8 presents the results of Adaptive-SeqKD compared to the standard approach. The metrics are displayed as percentage changes:

Row 1 (Quality): There is almost no loss in translation quality (BLEU and Comet scores remain stable).
Row 2 (Failures): This is where the magic happens. Look at the ExMem and OscHal (Oscillatory Hallucination) columns.
ExMem drops significantly (bars going down).
OscHal sees a massive reduction, dropping by over 50% in some cases.

By simply focusing the Teacher on its most confident, high-quality knowledge before passing it on, the Student inherits the general translation ability without inheriting the instability and privacy risks.

Conclusion and Implications

The paper “Memorization Inheritance in Sequence-Level Knowledge Distillation” uncovers a critical nuance in how we build efficient AI systems. It challenges the assumption that distillation is purely a compression technique. Instead, it frames distillation as a transmission process where both capabilities and “fault modes” are passed down.

Key Takeaways for Students and Practitioners:

SeqKD is a double-edged sword: It creates better, faster models (Student > Baseline), but it introduces privacy risks (Student > Baseline in Memorization).
Students are parrots: They don’t just learn how to translate; they memorize specific outputs from the Teacher, including hallucinations.
Data Quality Matters (Again): The Adaptive-SeqKD experiment proves that curating data—even just for a short finetuning phase—can drastically improve model robustness.

As we continue to rely on distilled models for real-world applications, active monitoring for extractive memorization and hallucinations is no longer optional—it is a necessity. The Student may be smart, but it still needs a responsible Teacher.

The Problem: Memorization and Privacy#

The Experimental Setup#

Establishing Competence#

Finding 1: SeqKD Facilitates Memorization#

Primary vs. Secondary Memorization#

Finding 2: Hallucination Inheritance#

Deep Dive: Subgroup Analysis#

The Paradox of Amplified Denoising#

The Solution: Adaptive-SeqKD#

Does it work?#

Conclusion and Implications#