Artificial intelligence has long sought a system that learns continuously—absorbing new information over time without forgetting what it already knows. Humans do this naturally: we don’t need to “retrain” our brains every time we learn a new recipe or song. Most AI models, however, are static. After a massive one-time training process, they’re frozen; teaching them anything new usually means starting over, which is computationally expensive and unsustainable.
This is where Continual Learning (CL) comes in. Its goal is simple but profound: allow models to learn from a continuous stream of data without catastrophic forgetting. One of the most practical ways to fight forgetting is replay (also known as rehearsal), where the model occasionally revisits previous samples while training on new ones. It’s like reviewing old notes before taking a new exam.
But there’s a catch. Naively applied, replay can double the cost of continual learning, since every training batch now includes both new and old data. To bring continual learning closer to scalability, Truman Hickok and colleagues, in the paper “Scalable Strategies for Continual Learning with Replay”, propose a suite of three complementary tools for learning efficiently:
- Low-Rank Adaptation (LoRA) — makes fine-tuning lighter and regularizes learning.
- Consolidation — reduces replay usage through a two-phase replay scheme.
- Model Merging — unifies task-specific weights smoothly across long sequences.
Together, they form a powerful framework that achieves the same performance with up to 65% fewer replay samples. Let’s walk through the key insights.
The Continual Learning Playground
At the heart of continual learning is the stability-plasticity dilemma: models must be plastic enough to learn new information but stable enough not to overwrite old knowledge. When a standard network is trained on a new task, its weights shift toward optimizing the new data—often at the expense of previously learned knowledge. This “catastrophic forgetting” is a fundamental problem in CL.
CL research typically organizes methods into three categories:
- Regularization-based: Add penalties that constrain important weights from drifting too far.
- Model expansion: Introduce new modules or layers for each task, isolating knowledge but risking unbounded growth.
- Replay-based: Store a small subset of old samples and mix them with new data during training, directly reinforcing past learning.
This paper focuses on making replay-based approaches scalable, since they’re broadly effective and conceptually simple.
The Replay Ratio (RR)
A critical variable is the replay ratio (RR)—the ratio of replay samples to new task samples within a batch:
\[ \mathrm{RR} = \frac{N_{\mathrm{replay}}}{N_{\mathrm{task}}} \]An RR of 1.0 (equal parts old and new samples) is the usual baseline but doubles training cost. Lowering RR reduces data processed per batch, speeding up training but risking more forgetting.

Figure 1. Lower replay ratios shorten training but may increase forgetting.
Strategy 1: LoRA — Learning on a Budget
Parameter-efficient fine-tuning (PEFT) techniques, which adapt large pre-trained models without updating all parameters, are crucial for modern AI. Low-Rank Adaptation (LoRA) is among the most popular.
LoRA assumes that the change in a model’s weights during fine-tuning can be expressed with a small, low-rank update. Instead of altering the entire weight matrix \( W_0 \), LoRA freezes it and trains two small matrices, \( A \) and \( B \), whose product approximates the change:
\[ W = W_0 + BA \]Because the rank \( r \ll \min(d, k) \), LoRA updates far fewer parameters—making training faster and lighter.
Applying LoRA to Continual Learning
In continual learning, the authors adopt LoRA as follows:
- For each new task \( t \), create new LoRA adapters \( A_t \) and \( B_t \).
- Train only these adapters on new data mixed with replayed samples.
- Merge the learned updates into the model: \( W_t = W_{t-1} + B_t A_t \).
- Discard adapters and start fresh for the next task.
This design yields an efficient training loop. Even more importantly, LoRA imposes implicit regularization: it restricts how much the model can change per task. Reduced plasticity means less forgetting—a built-in safety net.
LoRA vs. Full Fine-Tuning (FFT)
The paper compares LoRA and standard full fine-tuning across different conditions.

Figure 2. Across diverse task sizes and replay ratios, LoRA is competitive where flexibility is limited.
Highlights from the results:
- Abundant Replay: FFT dominates when there’s plenty of replay. Its flexibility helps assimilate new data when forgetting is already controlled.
- Small Tasks: In highly fragmented streams (tiny tasks, e.g., 2–3 classes), LoRA stabilizes learning and even surpasses FFT in the continual pre-training (CPT) setting.
- Sparse Replay: With low RR (like 0.1), FFT’s performance collapses due to forgetting. LoRA, however, degrades gracefully.
In short, LoRA shines under challenging, under-regularized conditions—small tasks or minimal replay. This robustness makes it ideal for scaling continual learning systems.
Strategy 2: Consolidation — Smarter, Not Harder Replay
Replay works—but it’s costly. What if you could use it smarter? The authors introduce consolidation, a two-phase replay strategy inspired by how biological brains “sleep” to consolidate memories.
The Two Phases
- Task-Learning Phase: Train each new task using a low replay ratio (e.g., RR = 0.25). This drastically cuts compute costs during active learning.
- Consolidation Phase: After finishing the task, train only on replayed samples from previous tasks. This dedicated session refines and rebalances the model’s knowledge.

Figure 3. Consolidation separates learning (low replay) and memory restoration (post-task replay).
Measuring Replay Efficiency
To quantify efficiency, the authors define the Total Replay Percentage (TRP)—the total replay samples used (both phases) relative to a 1.0 RR baseline:
\[ TRP = \frac{\sum_{i=1}^{T} (N_{\text{replay task},i} + N_{\text{replay consolidation},i})} {\sum_{i=1}^{T} N_{\text{replay baseline 1:1},i}} \times 100\% \]
Figure/Table 4. Consolidation achieves high accuracy even with half the usual replay.
What the Results Show
- Remarkable Efficiency: Consolidation matches the standard 1.0 RR baseline while using 45–55% fewer replay samples. For example, in CIL, 74.6% accuracy is achieved at just 55% TRP—nearly half the cost.
- Better Allocation Beats More Data: Even when TRP = 100% (same total replay samples), repurposing some of those samples for consolidation yields better performance. Accuracy improves from 73.8% to 76.1% with the same replay budget.

Figure 4. Consolidation improves consistency: fewer poorly learned tasks, higher overall accuracy.
Consolidation reframes replay from brute memory repetition into strategic memory reconciliation—doing less during learning but more afterward.
Strategy 3: Model Merging for a Unified Learner
The third pillar of scalability borrows from multi-task training: model merging (or task arithmetic), which unites specialist models into one generalist model by merging their weights.
In multi-task scenarios, merging prevents destructive interference between tasks by averaging task-specific modifications. In continual learning, merging plays a similar role—helping reconcile updates from successive tasks into one cohesive knowledge base.

Figure 5. Sequential merging integrates new knowledge as tasks arrive, unlike parallel merging that waits until the end.
Three Merging Approaches
Parallel Merging (Task Arithmetic):
\[ \theta_{\mathrm{PM}} = \theta_0 + \sum_{t=1}^T \alpha_t \tau_t, \quad \text{where } \tau_t = \theta_t^* - \theta_0 \]Train separate models per task and merge their updates afterward.
Exponential Moving Average (EMA):
\[ \theta_{\mathrm{EMA},k} = \lambda \, \theta_{\mathrm{EMA},k-1} + (1-\lambda) \, \theta_k \]Maintain a running average of weights, smoothing updates across tasks—efficient and regularizing.
Sequential Merging (Proposed):
\[ \theta_t = (1-\alpha)\,\theta_{t-1} + \alpha\,\theta_t^* \]Merge pre- and post-task weights right after training each task. It’s lightweight and fully online.
Performance Comparison

Figure 6. Sequential merging scales gracefully with more tasks, rivaling EMA.
As the number of tasks increases, parallel merging deteriorates due to interference among task vectors. Sequential merging, however, stays competitive with EMA—a strong baseline—and retains performance across longer learning sequences.
This makes sequential merging an excellent fit for continual learning systems that must grow indefinitely.
Putting It All Together: A Synergistic Toolkit
Finally, the authors test whether these strategies can combine. Specifically, they merge the two novel ones—consolidation and sequential merging—to build a hybrid system.
After each task:
- Merge pre- and post-task weights (sequential merging).
- Enter a consolidation phase with targeted replay.

Figure/Table 2. Combining sequential merging and consolidation yields the highest accuracy while using only 35% TRP.
The Results
The Sequential Merging + Consolidation combination delivers the best results:
- Highest accuracy in both CIL and CPT settings.
- Matches the performance of the full baseline while using only 35% of replay samples—a 65% reduction in replay cost.
Conclusion and Outlook
The study Scalable Strategies for Continual Learning with Replay lays the foundation for efficient lifelong learning. It shows that by combining lightweight adaptation, smart replay scheduling, and principled weight integration, continual learners can maintain high performance without excessive computation.
Key takeaways:
- LoRA: Offers natural regularization and robustness when replay is limited or tasks are small.
- Consolidation: Makes replay vastly more efficient, achieving the same results with up to 55% less data.
- Sequential Merging: Integrates knowledge smoothly, rivaling EMA in effectiveness but offering more control.
- Together: These techniques are synergistic—cutting replay usage by 65% while retaining baseline accuracy.
Though experiments focused on image classification, the ideas generalize easily to other domains—from large language models to robotics. The vision of AI that learns continually—efficiently, adaptively, and without forgetting—might be closer than ever.
](https://deep-paper.org/en/paper/2505.12512/images/cover.png)