Catastrophic Forgetting in NMT: Why Your Medical Translator Forgot How to Say "Hello"

Imagine you have a brilliant translator who speaks fluent, general-purpose German and English. You want them to specialize in medical texts, so you send them to medical school (or, in machine learning terms, you fine-tune them on a medical dataset). They come back an expert in “myocardial infarctions” and “intravenous drips.”

But then, you ask them to translate a simple news article about a football game. Suddenly, they start translating “game” as “experiment match,” “player” as “subject,” and they seem to have completely forgotten common words they knew just weeks ago.

This phenomenon is known as Catastrophic Forgetting. It is one of the most persistent headaches in Neural Machine Translation (NMT). While researchers have known that it happens for decades, we have understood surprisingly little about what exactly is forgotten and why specific datasets trigger it more than others.

In this post, we are doing a deep dive into the paper “Domain adapted machine translation: What does catastrophic forgetting forget and why?” by Danielle Saunders and Steve DeNeefe. This research moves beyond just observing that “scores went down” and acts as a forensic investigation into the crime scene of the neural network, uncovering exactly which words go missing and how the contents of your data are to blame.

The Problem: Specialization vs. Generalization

In the world of NMT, we rarely train models from scratch for every specific task. It’s too expensive and data-intensive. Instead, we take a strong, pre-trained “Generic” model (trained on millions of sentence pairs from news, Wikipedia, etc.) and perform Domain Adaptation. This usually involves fine-tuning: continuing the training process on a smaller, domain-specific dataset (like Legal, Medical, or IT manuals).

The goal is to improve performance in the new domain. The risk is that the model overwrites its previous knowledge.

Prior to this paper, catastrophic forgetting was mostly measured by looking at BLEU scores. If a model’s BLEU score on a generic news test set dropped after fine-tuning on medical data, we said, “It forgot.”

But this is a blunt instrument. A lower BLEU score doesn’t tell us:

What was forgotten? Did the model forget grammar? Did it forget specific words?
Is the change actually bad? Maybe the model is just using a more precise medical term that technically doesn’t match the generic reference but is still readable.
Why did it happen? Was the medical dataset too small? Was it too repetitive?

The researchers argue that without understanding the relationship between the adaptation data and the forgetting, we are just guessing at solutions.

The Core Method: Measuring “Vocabulary Shift”

To solve this, the authors moved away from generic quality scores and developed a specific metric to track Vocabulary Shift. They wanted to identify tokens (words or sub-words) that the Generic model used correctly, but the Adapted model fails to produce.

Introducing `ForgetGenUse`

They propose a metric called ForgetGenUse. The logic is elegant but requires a bit of unpacking.

For a specific token (let’s say, the word “happy”), they look at a test sentence and its reference translation.

They count how many times the Generic (Original) model produced “happy” correctly (capped by how many times it actually appears in the reference).
They count how many times the Adapted model produced “happy” correctly.
If the Generic model got it right and the Adapted model didn’t, the difference is the “Forgetting” score.

The mathematical formulation is shown below:

Equations defining the calculation of token-level forgetting.

Here, \(O\) is the Original model and \(A\) is the Adapted model. The score measures the “Generic Use” that was lost.

To get a score for the whole dataset, they sum these up and normalize them:

Equation for the normalized corpus-level ForgetGenUse score.

Why is this better than BLEU?

Standard metrics reward you for getting things right. ForgetGenUse specifically penalizes the model for knowing something before and losing it now. It distinguishes between a model that never knew the word “happy” (which is a quality issue) and a model that knew it but lost it (which is catastrophic forgetting).

Experiment: Intentionally Breaking Models

To test this, the researchers took two pre-trained Transformer models (German-to-English and English-to-Japanese) and intentionally tried to induce catastrophic forgetting. They adapted these models to eight different domains, including:

IT: Software manuals (highly technical).
Koran: Religious text (very distinct style/vocabulary).
Law: EU legislation.
Medical: EMEA medical text.
Subtitles: Movie subtitles (conversational).

They fine-tuned the models until quality on generic news text dropped. The results showed a wide variance in how much was forgotten depending on the domain.

Table showing the drop in BLEU and COMET scores, and the rise in ForgetGenUse scores across different domains.

In the table above, look at the Kor (Koran) domain. It has a massive drop in BLEU (\(\Delta\)22.3) and a high ForgetGenUse score (0.25). Compare that to Law, which saw a much smaller drop. This proves that not all domains are equally “dangerous” to a model’s general memory.

What Exactly is Forgotten?

This is where the paper gets fascinating. The authors aligned the translations to see exactly which words were being replaced. They found that forgetting isn’t just the model producing garbage; it is Detrimental Vocabulary Shift.

The adapted models start forcing in-domain words into generic sentences where they don’t belong.

Table showing specific examples of forgotten tokens and their replacements.

Let’s analyze a few examples from the table above (Table 3). These are generic German sentences translated into English by a model adapted to specific domains:

The “Trump” to “Donald” Shift (IT Domain):

Generic Model: Translates correctly as “Trump.”
IT-Adapted Model: Translates as “Donald.”
Why? In the IT training data, “Trump” never appears. However, “Donald” appears (referring to Donald Knuth, a famous computer scientist). Even though the context is completely different, the model has learned that “Donald” is a valid token in its new world, while “Trump” is not.

The “England” to “Kingdom” Shift (Koran Domain):

Generic Model: “England.”
Koran-Adapted Model: “Kingdom.”
Why? The Koran dataset doesn’t mention England, but it frequently mentions “Kingdom” (in a divine sense). When the model sees the concept of a country or land, its probability distribution is now heavily skewed toward the word “Kingdom.”

The “Weeks” to “Months” Shift (Medical Domain):

Generic Model: “week.”
Med-Adapted Model: “month.”
Why? This is a meaning-changing error. In medical reports, perhaps “months” are a more common unit of time for treatments than “weeks.” The model over-generalizes this statistical bias.

In-Domain vs. Out-of-Domain Tokens

The researchers measured ForgetGenUse separately for words that appear in the domain data (In-Domain/ID) and those that don’t (Out-of-Domain/OOD).

Table comparing forgetting rates for In-Domain versus Out-of-Domain tokens.

As shown in Table 4, Out-of-Domain tokens (ForgetGenUseOOD) are forgotten at a much higher rate. This makes intuitive sense: if the model stops seeing a word during fine-tuning, its probability drops. However, the replacements (like “Kingdom” for “England”) are often words that do appear in the adaptation data. The model is hallucinating domain-specific terminology into general sentences.

The Investigation: Why does this happen?

We know what is happening (vocabulary replacement). Now, why does the Koran dataset destroy generic performance while the Law dataset is relatively safe? The authors tested several hypotheses.

Suspect 1: Dataset Size

Hypothesis: Maybe smaller datasets cause more forgetting because they provide less signal? Or maybe larger datasets cause more because they update weights more frequently? Result: No clear correlation. Even when they subsampled all datasets to be the same size (number of tokens), the ranking of which domains caused the most forgetting remained mostly the same. Size matters for the amount of forgetting, but it doesn’t explain the difference between domains.

Suspect 2: Sentence Length & Quality

Hypothesis: Maybe domains with very short sentences (like Subtitles) are ambiguous and confuse the model? Experiment: They tested on subsets of short vs. long sentences. Result: Mixed. Very short, noisy sentences (bad alignments) can accelerate forgetting, but clean short sentences aren’t inherently destructive.

Suspect 3: Vocabulary Coverage (The Culprit!)

Hypothesis: The extent to which the adaptation dataset covers the generic vocabulary determines forgetting. If your medical dataset includes the word “game,” you won’t forget “game.” If it doesn’t, you might replace it with “match.”

Result: Strong Correlation.

Table showing the correlation between domain heuristics and forgetting.

Table 7 is the “smoking gun.” Look at the rows for Src-vcb cover (Source vocabulary coverage) and Trg-vcb cover (Target vocabulary coverage).

The Koran (Kor) domain has extremely low coverage (0.23). This means 77% of the generic vocabulary is completely missing from the Koran training data. Result: Massive forgetting.
The Law domain has high coverage (0.70). Result: Minimal forgetting.

The correlation is significant: The less your adaptation data overlaps with the generic vocabulary, the more your model will suffer from catastrophic forgetting.

The Solution: Minimal Mix-in

If lack of vocabulary coverage is the root cause, can we fix it by artificially boosting coverage?

A common technique to prevent forgetting is Mixed Fine-Tuning: mixing generic data back in with the domain data. Usually, people just mix them 1:1 randomly. But this slows down training and requires processing huge amounts of data.

The authors propose Minimal Mix-in.

The Strategy

Instead of random data, they search the generic dataset for sentences that contain target tokens missing from the adaptation dataset. They add just enough generic sentences to ensure every possible word is seen at least once during fine-tuning.

Table showing the number of lines required for Minimal Mix-in versus Random 1:1 mixing.

As Table 8 shows, “Minimal Mix-in” requires a tiny fraction of the data compared to the domain dataset or a random 1:1 mix. For the IT domain, they only needed 18,861 generic lines, whereas a 1:1 mix would require over 220,000.

The Results

Did this surgical injection of data work?

Table comparing forgetting metrics across different mix-in strategies.

Table 9 shows the results on the Generic Test Set (lower is better):

No mix-in: High forgetting (e.g., IT \(\Delta\)BLEU is 5.0).
Random 1:1: Very low forgetting (IT \(\Delta\)BLEU is 0.5).
Minimal Mix-in: Surprisingly close to Random 1:1 (IT \(\Delta\)BLEU is 1.1).

The Takeaway: By adding less than 10% of the data used in standard methods, they mitigated about 80% of the catastrophic forgetting.

Crucially, this method also preserved the improvements in the In-Domain performance. Sometimes, mixing in too much generic data makes the model worse at the specific medical/legal task. Minimal Mix-in strikes a perfect balance: it reminds the model of the words it’s at risk of forgetting, without drowning out the new domain knowledge.

Conclusion

This paper provides a crucial piece of the puzzle for Neural Machine Translation. It moves us away from treating “Catastrophic Forgetting” as a mysterious degradation of quality and redefines it as a coverage problem.

Here is what we learned:

Forgetting is Vocabulary Shift: Models replace known words with high-frequency domain synonyms, often ignoring the context (e.g., “Trump” becomes “Donald”).
Out-of-Domain Words are Vulnerable: If a word isn’t in your fine-tuning data, the model is likely to overwrite its probability with a word that is.
Coverage is King: The best predictor for how much a dataset will damage your model is how much of the generic vocabulary it misses.
Surgical Fixes Work: You don’t need to retrain on the whole internet. Ensuring your adaptation data covers the model’s vocabulary—even with a “Minimal Mix-in” approach—can prevent the vast majority of forgetting.

For students and practitioners, this implies that when designing a domain adaptation pipeline, you shouldn’t just grab a dataset and fine-tune. You should analyze the vocabulary overlap. If your domain data is highly specialized (narrow vocabulary), you must be prepared for the model to lose its general capabilities, and you should consider augmenting your data with a targeted, vocabulary-rich generic set.

The Problem: Specialization vs. Generalization#

The Blind Spot in Current Research#

The Core Method: Measuring “Vocabulary Shift”#

Introducing ForgetGenUse#

Why is this better than BLEU?#

Experiment: Intentionally Breaking Models#

What Exactly is Forgotten?#

In-Domain vs. Out-of-Domain Tokens#

The Investigation: Why does this happen?#

Suspect 1: Dataset Size#

Suspect 2: Sentence Length & Quality#

Suspect 3: Vocabulary Coverage (The Culprit!)#

The Solution: Minimal Mix-in#

The Strategy#

The Results#

Conclusion#