Never Forget: Scalable Tricks for AI That Learns Forever

Artificial intelligence has long sought a system that learns continuously—absorbing new information over time without forgetting what it already knows. Humans do this naturally: we don’t need to “retrain” our brains every time we learn a new recipe or song. Most AI models, however, are static. After a massive one-time training process, they’re frozen; teaching them anything new usually means starting over, which is computationally expensive and unsustainable.

This is where Continual Learning (CL) comes in. Its goal is simple but profound: allow models to learn from a continuous stream of data without catastrophic forgetting. One of the most practical ways to fight forgetting is replay (also known as rehearsal), where the model occasionally revisits previous samples while training on new ones. It’s like reviewing old notes before taking a new exam.

But there’s a catch. Naively applied, replay can double the cost of continual learning, since every training batch now includes both new and old data. To bring continual learning closer to scalability, Truman Hickok and colleagues, in the paper “Scalable Strategies for Continual Learning with Replay”, propose a suite of three complementary tools for learning efficiently:

Low-Rank Adaptation (LoRA) — makes fine-tuning lighter and regularizes learning.
Consolidation — reduces replay usage through a two-phase replay scheme.
Model Merging — unifies task-specific weights smoothly across long sequences.

Together, they form a powerful framework that achieves the same performance with up to 65% fewer replay samples. Let’s walk through the key insights.

The Continual Learning Playground

At the heart of continual learning is the stability-plasticity dilemma: models must be plastic enough to learn new information but stable enough not to overwrite old knowledge. When a standard network is trained on a new task, its weights shift toward optimizing the new data—often at the expense of previously learned knowledge. This “catastrophic forgetting” is a fundamental problem in CL.

CL research typically organizes methods into three categories:

Regularization-based: Add penalties that constrain important weights from drifting too far.
Model expansion: Introduce new modules or layers for each task, isolating knowledge but risking unbounded growth.
Replay-based: Store a small subset of old samples and mix them with new data during training, directly reinforcing past learning.

This paper focuses on making replay-based approaches scalable, since they’re broadly effective and conceptually simple.

The Replay Ratio (RR)

A critical variable is the replay ratio (RR)—the ratio of replay samples to new task samples within a batch:

\[ \mathrm{RR} = \frac{N_{\mathrm{replay}}}{N_{\mathrm{task}}} \]

An RR of 1.0 (equal parts old and new samples) is the usual baseline but doubles training cost. Lowering RR reduces data processed per batch, speeding up training but risking more forgetting.

Illustration of replay ratio (RR). Each task’s training batch has a portion of replay (green) and new task samples (yellow). Reducing RR dedicates more of the batch to task data and less to replay, decreasing total training time.

Figure 1. Lower replay ratios shorten training but may increase forgetting.

Strategy 1: LoRA — Learning on a Budget

Parameter-efficient fine-tuning (PEFT) techniques, which adapt large pre-trained models without updating all parameters, are crucial for modern AI. Low-Rank Adaptation (LoRA) is among the most popular.

LoRA assumes that the change in a model’s weights during fine-tuning can be expressed with a small, low-rank update. Instead of altering the entire weight matrix \( W_0 \), LoRA freezes it and trains two small matrices, \( A \) and \( B \), whose product approximates the change:

\[ W = W_0 + BA \]

Because the rank \( r \ll \min(d, k) \), LoRA updates far fewer parameters—making training faster and lighter.

Applying LoRA to Continual Learning

In continual learning, the authors adopt LoRA as follows:

For each new task \( t \), create new LoRA adapters \( A_t \) and \( B_t \).
Train only these adapters on new data mixed with replayed samples.
Merge the learned updates into the model: \( W_t = W_{t-1} + B_t A_t \).
Discard adapters and start fresh for the next task.

This design yields an efficient training loop. Even more importantly, LoRA imposes implicit regularization: it restricts how much the model can change per task. Reduced plasticity means less forgetting—a built-in safety net.

LoRA vs. Full Fine-Tuning (FFT)

The paper compares LoRA and standard full fine-tuning across different conditions.

Comparison of LoRA and Full Fine-Tuning (FFT) performance. Top row varies classes per task, bottom row varies replay ratio. FFT generally performs better overall, while LoRA shines when tasks are small or replay is limited.

Figure 2. Across diverse task sizes and replay ratios, LoRA is competitive where flexibility is limited.

Highlights from the results:

Abundant Replay: FFT dominates when there’s plenty of replay. Its flexibility helps assimilate new data when forgetting is already controlled.
Small Tasks: In highly fragmented streams (tiny tasks, e.g., 2–3 classes), LoRA stabilizes learning and even surpasses FFT in the continual pre-training (CPT) setting.
Sparse Replay: With low RR (like 0.1), FFT’s performance collapses due to forgetting. LoRA, however, degrades gracefully.

In short, LoRA shines under challenging, under-regularized conditions—small tasks or minimal replay. This robustness makes it ideal for scaling continual learning systems.

Strategy 2: Consolidation — Smarter, Not Harder Replay

Replay works—but it’s costly. What if you could use it smarter? The authors introduce consolidation, a two-phase replay strategy inspired by how biological brains “sleep” to consolidate memories.

The Two Phases

Task-Learning Phase: Train each new task using a low replay ratio (e.g., RR = 0.25). This drastically cuts compute costs during active learning.
Consolidation Phase: After finishing the task, train only on replayed samples from previous tasks. This dedicated session refines and rebalances the model’s knowledge.

Diagram showing continual learning with and without consolidation. (a) Standard learning uses replay during training only. (b.i) Applies low-RR with LoRA during task training. (b.ii) Adds a post-task replay phase for consolidation.

Figure 3. Consolidation separates learning (low replay) and memory restoration (post-task replay).

Measuring Replay Efficiency

To quantify efficiency, the authors define the Total Replay Percentage (TRP)—the total replay samples used (both phases) relative to a 1.0 RR baseline:

\[ TRP = \frac{\sum_{i=1}^{T} (N_{\text{replay task},i} + N_{\text{replay consolidation},i})} {\sum_{i=1}^{T} N_{\text{replay baseline 1:1},i}} \times 100\% \]

Table summarizing performance across different RR, CSR, and TRP settings. Consolidation achieves baseline performance using up to 55% fewer replay samples.

Figure/Table 4. Consolidation achieves high accuracy even with half the usual replay.

What the Results Show

Remarkable Efficiency: Consolidation matches the standard 1.0 RR baseline while using 45–55% fewer replay samples. For example, in CIL, 74.6% accuracy is achieved at just 55% TRP—nearly half the cost.
Better Allocation Beats More Data: Even when TRP = 100% (same total replay samples), repurposing some of those samples for consolidation yields better performance. Accuracy improves from 73.8% to 76.1% with the same replay budget.

Violin plots comparing per-task accuracy distributions. Consolidation reduces outliers and increases overall stability.

Figure 4. Consolidation improves consistency: fewer poorly learned tasks, higher overall accuracy.

Consolidation reframes replay from brute memory repetition into strategic memory reconciliation—doing less during learning but more afterward.

Strategy 3: Model Merging for a Unified Learner

The third pillar of scalability borrows from multi-task training: model merging (or task arithmetic), which unites specialist models into one generalist model by merging their weights.

In multi-task scenarios, merging prevents destructive interference between tasks by averaging task-specific modifications. In continual learning, merging plays a similar role—helping reconcile updates from successive tasks into one cohesive knowledge base.

Diagram comparing parallel and sequential merging. Parallel trains all task models independently and merges them afterward. Sequential merges model states after each task incrementally.

Figure 5. Sequential merging integrates new knowledge as tasks arrive, unlike parallel merging that waits until the end.

Three Merging Approaches

Parallel Merging (Task Arithmetic):
\[ \theta_{\mathrm{PM}} = \theta_0 + \sum_{t=1}^T \alpha_t \tau_t, \quad \text{where } \tau_t = \theta_t^* - \theta_0 \]
Train separate models per task and merge their updates afterward.
Exponential Moving Average (EMA):
\[ \theta_{\mathrm{EMA},k} = \lambda \, \theta_{\mathrm{EMA},k-1} + (1-\lambda) \, \theta_k \]
Maintain a running average of weights, smoothing updates across tasks—efficient and regularizing.
Sequential Merging (Proposed):
\[ \theta_t = (1-\alpha)\,\theta_{t-1} + \alpha\,\theta_t^* \]
Merge pre- and post-task weights right after training each task. It’s lightweight and fully online.

Performance Comparison

Line plots showing accuracy vs. number of tasks for parallel, sequential, and EMA merging. Sequential and EMA maintain high accuracy as tasks grow, while parallel declines.

Figure 6. Sequential merging scales gracefully with more tasks, rivaling EMA.

As the number of tasks increases, parallel merging deteriorates due to interference among task vectors. Sequential merging, however, stays competitive with EMA—a strong baseline—and retains performance across longer learning sequences.

This makes sequential merging an excellent fit for continual learning systems that must grow indefinitely.

Putting It All Together: A Synergistic Toolkit

Finally, the authors test whether these strategies can combine. Specifically, they merge the two novel ones—consolidation and sequential merging—to build a hybrid system.

After each task:

Merge pre- and post-task weights (sequential merging).
Enter a consolidation phase with targeted replay.

Table comparing all combinations of merging and consolidation strategies. The combination “Seq. + Consol.” achieves top performance with minimal replay cost.

Figure/Table 2. Combining sequential merging and consolidation yields the highest accuracy while using only 35% TRP.

The Results

The Sequential Merging + Consolidation combination delivers the best results:

Highest accuracy in both CIL and CPT settings.
Matches the performance of the full baseline while using only 35% of replay samples—a 65% reduction in replay cost.

Conclusion and Outlook

The study Scalable Strategies for Continual Learning with Replay lays the foundation for efficient lifelong learning. It shows that by combining lightweight adaptation, smart replay scheduling, and principled weight integration, continual learners can maintain high performance without excessive computation.

Key takeaways:

LoRA: Offers natural regularization and robustness when replay is limited or tasks are small.
Consolidation: Makes replay vastly more efficient, achieving the same results with up to 55% less data.
Sequential Merging: Integrates knowledge smoothly, rivaling EMA in effectiveness but offering more control.
Together: These techniques are synergistic—cutting replay usage by 65% while retaining baseline accuracy.

Though experiments focused on image classification, the ideas generalize easily to other domains—from large language models to robotics. The vision of AI that learns continually—efficiently, adaptively, and without forgetting—might be closer than ever.

The Continual Learning Playground#

The Replay Ratio (RR)#

Strategy 1: LoRA — Learning on a Budget#

Applying LoRA to Continual Learning#

LoRA vs. Full Fine-Tuning (FFT)#

Strategy 2: Consolidation — Smarter, Not Harder Replay#

The Two Phases#

Measuring Replay Efficiency#

What the Results Show#

Strategy 3: Model Merging for a Unified Learner#

Three Merging Approaches#

Performance Comparison#

Putting It All Together: A Synergistic Toolkit#

The Results#

Conclusion and Outlook#