Don't Start from Scratch: The Science of Cross-Lingual Continual Pre-Training

If you have ever tried to train a Large Language Model (LLM) from scratch, you know the pain. It requires massive computational resources, vast amounts of data, and a budget that usually only large tech giants possess. But here is a question that has puzzled researchers: if we already have excellent models fluent in English (like LLaMA), why do we burn millions of dollars training new models from scratch just to teach them a new language like Chinese?

Can’t we just… continue where the English model left off?

This concept is called Continual Pre-Training (CPT). While it sounds intuitive, the dynamics of how knowledge transfers between languages at a massive scale have been largely a mystery. Do the famous “Scaling Laws” still apply? Does the model forget English while learning Chinese? How much compute do you actually save?

In the paper Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale, researchers from the Chinese Academy of Sciences, Peking University, and other institutes conducted a massive empirical study to answer these questions. They trained models across 40 different sizes to map out the physics of cross-lingual transfer.

In this post, we will break down their findings, explore the math behind their new “Extended Scaling Law,” and discover why teaching an old model new tricks might be the future of efficient AI.

The Problem: The High Cost of “Tabula Rasa”

Most foundational LLMs today are trained tabula rasa—from a blank slate. The model starts with randomly initialized parameters (noise) and slowly learns probability distributions of text over trillions of tokens.

This is inefficient. Humans don’t learn this way. If you speak English and want to learn Chinese, you don’t forget what a “verb” or “noun” is; you map new vocabulary to your existing understanding of linguistic structure.

The researchers proposed testing Continual Pre-Training (CPT) systematically. They set up a rigorous comparison between two strategies:

Training from Scratch: Random initialization trained on Chinese corpora.
Continual Pre-Training: Initializing with a pre-trained English model, then training on Chinese corpora.

The Experimental Setup

To ensure the results were statistically significant and followed a predictable trend (a scaling law), the authors didn’t just train one big model. They trained 40 different model sizes ranging from 40M to 5B parameters.

Table 1: Training configurations for pre-training. All three sets of models are trained with identical parameter sizes which cover 40 sizes spanning from 50M to 5.5B.

As shown in the table above, they maintained consistency across batch sizes and learning schedules to ensure that any difference in performance came purely from the initialization strategy.

The Core Finding: CPT is Cheaper and Faster

The primary metric for success in pre-training is Validation Loss—essentially, how surprised the model is by new text. Lower loss means a smarter model.

When the researchers plotted the loss against the computational effort (measured in FLOPS), the difference between the two strategies was immediately visible.

Figure 1: Loss curves of pre-training and continual pre-training (CPT) across different model sizes.

Look closely at the Left graph in Figure 1. The blue line (CPT) is consistently lower than the magenta line (From Scratch).

Early gains: At the beginning of training, the CPT models drop in loss much faster. This makes sense; the model already knows how to process language in general (syntax, grammar logic), even if the specific tokens are different.
Persistent gains: Even as training continues, the gap narrows but remains.

The Right graph zooms in on a specific case (a 2B parameter model). It reveals a stunning statistic: to reach the same level of performance (loss), the CPT model requires approximately 50% fewer FLOPS. In the world of LLM training, where runs can cost millions, a 50% discount is revolutionary.

Redefining Scaling Laws for Transfer Learning

One of the most important contributions of this paper is the mathematical formalization of why CPT works.

We generally accept the Chinchilla Scaling Law (Hoffmann et al., 2022) as the gold standard for understanding LLM performance. It states that Loss (\(L\)) is a function of Parameters (\(N\)) and Data (\(D\)).

Standard Chinchilla Scaling Law Equation.

In this standard equation:

\(E\) is the irreducible loss (entropy of natural text).
\(A/N^\alpha\) represents the error due to limited model size.
\(B/D^\beta\) represents the error due to limited training data.

However, this law assumes you are starting from scratch. It fails to account for the “transfer” of knowledge from the source model. The researchers found that simply fitting this equation to CPT data resulted in poor predictions.

The Extended Scaling Law

To fix this, the authors proposed a new term. They hypothesized that the amount of data effectively “transferred” from the English model isn’t constant—it depends on how big the model is. A larger model has a higher capacity to store abstract linguistic structures (meta-knowledge) that apply to both English and Chinese.

They introduced a joint data-parameter scaling term:

Extended Scaling Law Equation.

Notice the change in the third term: \(\frac{B'}{D^{\beta'} N^{\gamma}}\). Instead of data (\(D\)) acting alone, it is now coupled with the parameter size (\(N\)) via the exponent \(\gamma\).

If \(\gamma\) is positive, it means that larger models are more efficient at utilizing training data during transfer learning.

The researchers fit this new equation to their experimental data and found \(\gamma = 0.08\). This mathematically confirms that cross-lingual transfer is positively correlated with model size. A 5B parameter model gets more “bang for its buck” from the pre-trained English weights than a 100M parameter model does.

Quantifying the Savings

How much data does this process actually save? The researchers calculated the “Effectively Transferred Data”—essentially, how many tokens of Chinese training you didn’t have to do because the model already knew English.

Figure 2: Reduced computational resources (top) and data consumption (bottom) with CPT.

The bottom chart in Figure 2 is particularly illuminating.

The Y-axis represents “Effectively Transferred Data.”
The lines represent different model sizes (Blue is 5B, Green is 100M).

You can see that the Blue line (5B) is much higher than the others. This validates the Extended Scaling Law: larger models are essentially “downloading” more knowledge from their English initialization than smaller models are.

Does it Generalize? (French, Russian, and Chinese)

You might wonder if this only works for Chinese, or if it’s a universal property of LLMs. To test this, the authors ran similar experiments transferring an English model to French and Russian.

Figure 3: Zero-shot evaluation for pre-trained and continually pre-trained (CPT) models of different languages.

Figure 3 shows the zero-shot accuracy on various benchmarks.

Blue/Green/Orange bars: CPT (Continual Pre-Training).
Grey bars: Training from scratch.

In every single case—whether it was French (similar to English), Russian (Cyrillic script), or Chinese (Logographic)—the CPT models outperformed the scratch models. Interestingly, the transfer effect was strongest for French, likely because it shares high linguistic similarity (vocabulary and grammar) with the source language, English.

Best Practices: How to Allocate Your Budget

If you are an engineer planning a CPT run, you have a fixed compute budget (\(C\)). The Chinchilla paper gave us a famous recipe for how to split that budget between model size (\(N\)) and training data (\(D\)) when training from scratch:

Standard Chinchilla Optimal Allocation.

However, because CPT makes learning more efficient, the optimal recipe changes. Using their new Extended Scaling Law, the researchers derived the CPT Optimal Allocation:

CPT Optimal Allocation Equation.

This looks complex, so let’s visualize it using the “Efficient Frontier.”

Figure 5: Predicted compute-optimal efficient frontiers on IsoLoss contour for both strategies.

In Figure 5, the Magenta line (CPT) sits to the left of the Blue line (From Scratch).

The X-axis is FLOPs (Cost).
The Y-axis is Parameters (Model Size).

For the same budget (moving vertically up a specific X-value), the optimal CPT model is larger than the optimal scratch model.

The Takeaway: Because the model is “pre-matured” by the English weights, you don’t need as many training tokens to reach convergence. Therefore, if you have a specific compute budget, you should spend more of it on increasing the model size rather than gathering more data.

The Danger of Catastrophic Forgetting

There is one major catch to CPT. As the model learns Chinese, does it forget English?

The answer is a resounding yes. Without intervention, the model’s perplexity on English text spikes, effectively destroying its original capabilities. This is known as Catastrophic Forgetting.

The solution proposed by the authors is Data Replaying: mixing a percentage of the original English data back into the Chinese training stream. But what is the magic ratio? 1%? 50%?

Figure 4: Scaling of CPT with different English replaying ratios.

Figure 4 illustrates the “Scaling Law of Forgetting.”

Left Graph: Loss on English Text.
Right Graph: Loss on Chinese Text.
Colors: Darker blue = higher ratio of English data mixed in.

On the Left, notice the “U-shape.” Initially, English loss goes up (forgetting), but as training continues with mixed data, the loss goes back down. Crucially, the light blue lines (1% or 5% English data) still show significant forgetting (high loss).

However, looking at the Right graph, adding English data hardly hurts the Chinese performance at all! The curves are almost identical.

Finding the Sweet Spot

To find the exact optimal ratio, the researchers benchmarked the models on downstream tasks.

Figure 6: Model performance on English and Chinese benchmarks at different English data replaying ratios.

In Figure 6, we see a crossover.

Solid Dark Blue (Pre-trained on English): Great at English, terrible at Chinese.
Pink (Pre-trained on Chinese): Great at Chinese, terrible at English.
Light Blue (Continue w/ 20% English): The balance point.

The authors conclude that a replay ratio of 10% to 30% is the sweet spot. It effectively prevents catastrophic forgetting of the source language while incurring almost no penalty on learning the target language.

Conclusion

The days of training every LLM from scratch are likely numbered. This research provides a solid theoretical and empirical foundation for Continual Pre-Training.

By treating pre-trained models not as finished products but as initialization checkpoints, we can democratize access to LLMs for new languages. The paper proves that this approach is not only viable but superior, offering:

Efficiency: 25-50% reduction in compute.
Performance: Lower loss than training from scratch.
Scalability: A new scaling law that helps us design optimal models.

For students and practitioners, the lesson is clear: meaningful AI progress doesn’t always require a bigger GPU cluster. Sometimes, it just requires standing on the shoulders of the giants (or LLaMAs) that came before.

The Problem: The High Cost of “Tabula Rasa”#

The Experimental Setup#

The Core Finding: CPT is Cheaper and Faster#

Redefining Scaling Laws for Transfer Learning#

The Extended Scaling Law#

Quantifying the Savings#

Does it Generalize? (French, Russian, and Chinese)#

Best Practices: How to Allocate Your Budget#

The Danger of Catastrophic Forgetting#

Finding the Sweet Spot#

Conclusion#