Large Language Models (LLMs) are impressive generalists. Trained on massive corpora like the Common Crawl, they know a little bit about everything. However, in the real world, “a little bit” isn’t always enough. Whether it is a law firm needing a model specialized in contract analysis, or a software house needing a coding assistant, we often need to take a general-purpose model and teach it a specific domain.

This process is called Continual Pre-Training (CPT). It sounds straightforward: take a pre-trained model and keep training it on new, domain-specific data. But CPT introduces a notorious tension. As the model learns the new domain (downstream performance), it tends to forget what it learned originally (general performance). This phenomenon, known as catastrophic forgetting, creates a delicate balancing act for researchers.

Until recently, most research focused on the result of this process—how much was learned and how much was lost by the end. But what happens during the training? How do the learning dynamics evolve step-by-step?

In the paper “Learning Dynamics in Continual Pre-Training for Large Language Models,” Wang et al. (2025) propose a comprehensive mathematical framework to answer these questions. They derive a CPT Scaling Law that models the exact trajectory of validation loss throughout the training process. By decoupling the effects of learning rate annealing and data distribution shifts, this work allows us to predict the performance of an LLM at any step of its continual training journey.

The Core Problem: The CPT Tug-of-War

To understand the contribution of this paper, we first need to visualize the problem. When you take a model trained on a general dataset (let’s call it \(D_{pt}\), for Pre-Training) and switch it to a new dataset (\(D_{cpt}\), for Continual Pre-Training), the loss curves diverge.

CPT loss curves under different learning rate schedules (LRS): constant (a-c) and warmup-stable-decay (WSD).

As shown in Figure 1 above, two things happen simultaneously during CPT:

  1. General Performance Degradation: The blue dashed line represents the model’s ability to predict the original data (\(D_{pt}\)). Notice how the loss spikes when the data switches (the vertical dashed line). This is the distribution shift hitting the model.
  2. Downstream Adaptation: The orange dashed line shows the model getting better at the new domain (\(D_{cpt}\)).

The researchers conceptualize the CPT process as a transition—a bridge crossing from an “initial pre-training trajectory” to a “new domain-specific trajectory.” The goal of this research is to mathematically describe that bridge.

Deconstructing the Transfer Curve

The authors begin by defining two theoretical baselines, which they call “Hidden PT Curves.” Imagine parallel universes:

  1. Hidden PT Curve on \(D_{pt}\): In this universe, we never switched data. We just kept training on the original dataset using the CPT learning rate schedule.
  2. Hidden PT Curve on \(D_{cpt}\): In this universe, we trained on the new dataset from scratch, right from the beginning.

The actual CPT process is the Transfer Curve that moves between these two hidden states. It peels away from the first curve and converges toward the second.

The Invariance of Distribution Shift

One of the most striking findings in the paper’s pilot observations is that the “Distribution Shift”—the penalty the model pays for switching data sources—follows a predictable pattern regardless of when you start the transfer.

The transfer loss curve in Dpt and Dcpt validation sets for different transfer starting points.

As illustrated in Figure 2, whether you switch datasets at step 10,000, 20,000, or 30,000, the shape of the deviation (the red arrows) remains consistent. This suggests that the distribution shift is a fundamental property of the distance between the two datasets, largely independent of the model’s current state.

The CPT Scaling Law

The heart of the paper is the derivation of the mathematical law that governs this process. The authors achieve this by separating the training dynamics into two distinct components: Learning Rate (LR) Annealing and Distribution Shift.

Component 1: Learning Rate Annealing

First, we need a way to describe how a model learns when the data doesn’t change. The authors build upon previous work by Tissue et al. (2024), which describes loss not just as a function of compute, but as a function of the learning rate schedule.

The base loss is defined as:

Equation 1: The scaling law with LR annealing.

Here is the breakdown of the terms:

  • \(S_1\) (Forward Area): The sum of learning rates used so far. This represents the total “distance” the model has traveled in the optimization landscape.
  • \(S_2\) (Annealing Area): A term that captures the benefit of lowering the learning rate. When the LR drops (anneals), the model settles into sharper minima, lowering the loss.
  • \(L_0, A, C\): Constants specific to the model and data.

When we apply this to CPT without any distribution shift, the base loss considers the accumulated training from both the pre-training (\(pt\)) and continual (\(cpt\)) phases:

Equation 2: Base loss combining PT and CPT phases.

Component 2: The Distribution Shift

Next, the authors model the “penalty” for switching data distributions. Based on their pilot observations, this follows a power-law form that scales with the amount of training done in the CPT phase (\(S_1^{cpt}\)).

Equation 3: The distribution shift term.

This term, \(\Delta L(t)\), represents the gap between the actual loss and the theoretical baseline. It starts at 0 (before transfer) and grows (or shrinks, depending on the domain view) as the model adapts to the new data.

The Combined Law

By stitching these two components together, the authors present the final CPT Scaling Law. This single equation predicts the loss at any step \(t\) during the continual pre-training process:

Equation 4: The full CPT Scaling Law.

This equation is elegant because it isolates the specific drivers of performance:

  1. \(S_1\) (Training Volume): How much training has been done?
  2. \(S_2\) (Annealing): How much has the learning rate decayed?
  3. Distribution Shift (The second line): How different is the new data from the old?

Validating the Law

Does this equation actually work? The authors fitted this curve to various learning rate schedules, including the popular Cosine schedule and the Warmup-Stable-Decay (WSD) schedule.

Using Eq. 4 to fit all PT and CPT loss curves with different LRS (WSD and Cosine).

Figure 3 shows the fit. The lines represent the predicted loss using the equation, and the data points represent the actual training runs. The match is nearly perfect for both general domain (\(D_{pt}\)) and downstream domain (\(D_{cpt}\)) validation losses. This confirms that the law holds regardless of the specific schedule used.

The “Slide” Analogy and Loss Potential

To build intuition, the authors offer a geometric interpretation of their findings. You can visualize the loss landscape as a surface. The CPT process is like a slide transitioning from one surface to another.

The loss surface of the CPT process and two directional views.

In Figure 4, look at panel (c). This “Annealing View” introduces a critical concept: Loss Potential.

Loss Potential essentially measures how “unfinished” the pre-trained model is. A model that has not yet decayed its learning rate has high loss potential—it is sitting high up on the loss curve, ready to drop significantly as soon as the learning rate anneals. Conversely, a fully converged model (where LR is already near zero) has low loss potential.

Why “Undercooked” Models Transfer Better

A major practical insight from this paper is the relationship between Loss Potential and downstream performance.

The researchers found that PT models with higher loss potential adapt better to new domains. If you fully anneal your model during pre-training (squeezing out every drop of performance on the general domain), it becomes rigid. It has settled deep into a minimum that makes it harder to traverse to the new domain’s minimum.

Figure 5: The impact of loss potential.

Figure 5 illustrates this clearly.

  • Panel (b) and (e): The “True Loss” curves show that models with higher loss potential (purple lines) achieve lower final loss on the new domain (\(D_{cpt}\)) than models that were fully annealed (light blue lines).
  • Panel (c) and (f): The predictions from the scaling law confirm this trend.

Finding: If you are releasing an open-source model intended for others to fine-tune or continue training, do not fully anneal it. Release a checkpoint with high loss potential.

Critical Factors in CPT

Using their scaling law, the authors analyzed several other hyperparameters critical to the success of continual pre-training.

1. Peak Learning Rate

When starting CPT, you typically “warm up” the learning rate again. How high should it go? According to the law, a higher peak LR in the CPT phase accelerates adaptation to the new domain (\(D_{cpt}\)), reducing its loss faster. However, it also causes a sharper spike in the general domain loss (\(D_{pt}\)). This is the classic stability-plasticity dilemma quantified.

Dcpt predicted loss vs. peak LRs for different CPT steps.

2. The Replay Ratio

A common technique to fight catastrophic forgetting is Replay: mixing a percentage of the original data (\(D_{pt}\)) into the new training batch.

The authors extended their scaling law to account for this. They found that the replay ratio (\(r\)) impacts the distribution shift exponentially.

The fitted equation for replay ratios.

This complex-looking modification allows the scaling law to predict performance for any mixture of data.

Loss curves for different replay ratios.

As shown in Figure 19, adding even a small amount of original data (replay) drastically changes the curve, dampening the spike in general loss. The scaling law accurately predicts these trajectories, allowing engineers to simulate different ratios without running expensive experiments.

3. The Critical Point and Turning Length

When you start training on new data, the loss on the old data goes up. But will it ever come back down?

The authors identify a Critical Point.

  • Pre-Critical: If you stop training early enough, or if the datasets are similar enough, the general loss might eventually recover (the curve rises then falls).
  • Post-Critical: If the datasets are too distinct or you train too long, you cross a point of no return. The general loss will stabilize at a higher value than where it started.

Critical point and turning length in Dpt validation loss.

Optimization: Balancing the Trade-off

One of the most powerful applications of a scaling law is hyperparameter optimization. Instead of guessing, you can mathematically define what you want.

If we define our goal as minimizing a weighted combination of general loss and downstream loss:

Equation 5: Minimizing weighted loss.

We can solve for the optimal settings.

Optimizing hyper-parameters for CPT based on different coefficients.

Figure 8 visualizes these optimal frontiers:

  • Graph (a): Shows the optimal Loss Potential. If you care mostly about the new domain (low \(\lambda_1\)), you want a model with nearly 100% loss potential (high plasticity).
  • Graph (c): Shows the optimal Replay Ratio. Interestingly, the optimal replay ratio isn’t always linear.

Solving the “Black Box” Problem

Finally, the authors address a major hurdle for practitioners: Open-Source Models.

When you download a model like LLaMA-3, you don’t have access to its exact pre-training data or its specific loss trajectory. You are effectively starting CPT on a “Black Box.” Can the scaling law still work?

The authors propose using a Proxy Dataset. For example, using a slice of RedPajama (a replica of common crawl data) as a stand-in for the unknown \(D_{pt}\).

Fitting and predicting with proxy datasets.

Figure 18 (b) demonstrates that using a proxy dataset allows the scaling law to fit the loss curve (blue line) almost as well as having the ground truth. This makes the method highly practical for engineers working with diverse open-source foundation models.

Additionally, the authors show that you can even predict loss on Out-of-Distribution (OOD) datasets (datasets that are neither the original nor the target) by modeling them as linear combinations of the two known losses.

Predicting OOD loss. Equation 6: OOD Linear Combination.

Conclusion

The work of Wang et al. moves the field of Continual Pre-Training from alchemy toward chemistry. By establishing a rigorous CPT Scaling Law, they have provided a way to quantify the dynamics of transfer learning.

Key Takeaways:

  1. CPT is a Transition: It is mathematically predictable as a shift between two hidden learning curves.
  2. Don’t Over-Cook Your Models: If you plan to fine-tune or continue training, a model with high “Loss Potential” (less annealing) is superior.
  3. Predict Before You Train: By running short pilot runs to fit the constants in Equation 4, researchers can predict the performance of massive training runs, optimizing for peak learning rate, replay ratio, and training steps before committing significant compute resources.

As LLMs continue to specialize in law, medicine, coding, and science, understanding the physics of how they learn—and forget—is more critical than ever. This scaling law provides the blueprint for that understanding.