The Stability–Plasticity Dilemma: A Guided Tour of Continual Learning Research
Modern neural networks are extraordinarily capable — but only when their world is static. As soon as you feed them a sequence of different tasks, they typically forget earlier ones. This catastrophic forgetting is the central challenge that the field of continual learning (CL) tries to solve: how can a model remain plastic enough to learn new tasks while staying stable enough to retain what it already knows?
In 2021, De Lange et al. published one of the most thorough treatments of this problem for classification tasks: a survey, taxonomy, and large empirical study comparing 11 state-of-the-art continual learning methods across multiple datasets and model configurations. They also propose a practical framework for tuning the crucial “stability–plasticity” hyperparameters in a way that does not violate the continual learning premise (i.e., no access to previous-task data during validation).
This article distills the paper’s main ideas and experimental insights in an approachable way. If you want to understand the landscape of continual learning (what methods exist, how they work, and when they succeed or fail), this guided tour will get you there.
What “setting” do we study?
The paper focuses on the task-incremental classification setting:
- Tasks arrive sequentially: you get task T1, you train until convergence, then task T2 arrives, and so on.
- When training on task T_t you only have access to that task’s training data (no previous-task data).
- At test time the model is told which task/head to use (multi-head setup). This simplifies evaluation relative to the harder class-incremental, single-head setting.
The ideal objective is to minimize cumulative risk across all seen tasks. If f(·; θ) is our model with parameters θ and ℓ the loss, the objective after T tasks is
\[ \sum_{t=1}^T \mathbb{E}_{(x,y)\sim D^{(t)}} \left[ \ell(f(x; \theta), y) \right]. \]The challenge: while optimizing for the new task, changes to θ can hurt performance on previous tasks — that’s catastrophic forgetting.
A clean taxonomy of approaches
A practical way to understand continual learning methods is by how they preserve past information. The authors organize methods into three families:
- Replay methods — remember examples (or generate pseudo-examples) and rehearse them during new-task training.
- Regularization-based methods — add a penalty that discourages changing parameters deemed important for past tasks.
- Parameter isolation methods — dedicate disjoint parameters (or masked parameter subsets) to different tasks.
Each family has distinct trade-offs: replay methods are often strong but require storing data (privacy concerns); regularization methods are memory-light but sensitive to hyperparameters; parameter isolation methods can achieve near-zero forgetting but are limited by fixed capacity or require task labels at test time.

Fig. 1: Taxonomy of continual learning methods. The three main families (Replay, Regularization, Parameter isolation) branch into concrete subcategories and representative algorithms.
The core ideas (briefly)
- Replay methods:
- Rehearsal (store exemplars): iCaRL is a prominent example that stores per-class exemplars and uses them while learning new tasks.
- Pseudo-rehearsal (generative replay): train a generator that can produce samples from past tasks.
- Constrained optimization (gradient projection): GEM and variants constrain updates so they don’t increase loss on examples from previous tasks.
- Regularization-based:
- Data-focused (distillation): LwF uses the previous model as a teacher and forces the new model to match its outputs on current-task inputs.
- Prior-focused (parameter importance): EWC uses the Fisher information to estimate parameter importance, SI accumulates importance along the training trajectory, and MAS estimates importance via sensitivity of outputs to parameters (allowing unsupervised importance estimation).
- Parameter isolation:
- Fixed‑capacity masks: PackNet prunes and fixes parameters per task; HAT learns attention masks to gate units per task.
- Dynamic expansion: Progressive Neural Networks grow new modules per task.
These families make different trade-offs in memory, privacy, compute, and flexibility.
A critical methodological point: how do you tune hyperparameters fairly?
Many CL methods rely on a stability–plasticity hyperparameter: e.g., the strength λ of a parameter‑penalty in EWC, or the weight of a distillation loss in LwF, or the exemplar memory budget in replay methods. In past work, it was common to tune these by sweeping validation performance over all tasks — but that leaks information (you’re allowed to use old-task data during validation), which the continual learning setting forbids.
To fix this, the paper proposes the Continual Hyperparameter Selection Framework: a two-phase, per-task procedure that uses only data from the current task to select hyperparameters.
- Maximal Plasticity Search: First finetune a copy of the current model on the new task to find the best achievable accuracy on this task alone (call it A*). This estimates how well the model could perform if we ignored forgetting.
- Stability Decay: Start with a very stable hyperparameter setting (favoring no forgetting). Train the CL method; if accuracy on the current task is unacceptably far below A* (e.g., lower than (1−p)A*), lower the stability (decay the hyperparameter) and repeat. Stop when the current-task accuracy is acceptable.
This approach yields a realistic hyperparameter selection strategy that does not require storing old data. It’s simple, defensible, and suitable for real-world deployment.
The experimental arena
The authors run an extensive empirical comparison across:
- Datasets:
- Tiny ImageNet (balanced, 10 tasks of 20 classes each) — a controlled benchmark.
- iNaturalist (10 super-categories, highly imbalanced) — a more realistic, large-scale and imbalanced setting.
- RecogSeq (a sequence of 8 diverse recognition datasets: flowers, scenes, birds, cars, aircraft, actions, letters, digits) — a severe domain-shift stress test.
- Models:
- For Tiny ImageNet: a VGG-style backbone with four capacity variants (SMALL, BASE, WIDE, DEEP).
- For large datasets: AlexNet pretrained on ImageNet.
- Methods:
- Representative algorithms from the three families (iCaRL, GEM, LwF, EBLL, EWC, SI, MAS, mean-IMM, mode-IMM, PackNet, HAT), plus baselines (Finetuning and Joint training).
- Metrics:
- Per-task accuracy over time, final average accuracy, and a measure of forgetting (drop in accuracy of a task between right after it was learned and after learning subsequent tasks).
Key experimental design decisions included: (1) fixing a sensible exemplar-memory budget (comparable to the base model size) when using replay methods; (2) using the continual hyperparameter framework to tune forgetting-related hyperparameters without leaking past data; (3) testing model-capacity, regularization (dropout / weight decay), and task ordering effects.
What happened in the tournament? (high-level takeaways)
Below I summarize the most robust findings across datasets and model configurations.
Tiny ImageNet — a “clean-room” benchmark
- Finetuning (naïve training on each new task) catastrophically forgets older tasks — it’s the low bar you want to beat.
- PackNet (parameter isolation via iterative pruning) attains the overall highest average accuracy on Tiny ImageNet and shows essentially zero forgetting after compression. It benefits from preserving (freezing) parameters for past tasks.
- Replay methods (iCaRL) and robust regularization methods (MAS) are competitive with PackNet. Increasing exemplar memory improves iCaRL.
- EWC and SI were sensitive to hyperparameter choices and model capacity. MAS tended to be more robust under the continual hyperparameter selection framework than EWC/SI.
- Deeper networks (DEEP) did not help — in fact, they often performed worse than a wider but shallower network. Overfitting and unsuitable architectural depth harm the continual learning process.
The detailed Tiny ImageNet training curves (task-by-task) visualize these behaviors: PackNet and HAT produce flat or near-flat lines (no forgetting), replay and strong regularizers follow, while finetuning decays quickly.

Fig. 2: Tiny ImageNet evaluation across tasks for many methods (BASE model). Legends indicate average accuracy and average forgetting. Flat curves indicate little or no forgetting.
Model capacity matters — but not always in the expected way
- Too-small models lack capacity and suffer more forgetting.
- Very deep models can overfit single-task training and underperform in the multi-task accumulation setting.
- The WIDE (more filters per layer) variant usually outperformed deeper variants. So “wider not deeper” often wins in this task-incremental classification regime.
Regularization (dropout, weight decay) — helpful, but method-dependent
- Dropout helped methods prone to overfitting (e.g., SI, finetuning), often increasing final accuracy even if it sometimes increased measured forgetting slightly.
- Weight decay had mixed effects: sometimes improving wide models but often interfering with importance-based regularizers (EWC, MAS), because L2 may erode parameters those methods want to protect.
- The interaction of CL-specific penalties and standard regularizers is nontrivial and method-dependent — there is no one-size-fits-all recipe.
In the wild: iNaturalist and RecogSeq (imbalanced, highly heterogeneous)
- When tasks are unbalanced and highly dissimilar, the strengths and weaknesses become amplified.
- PackNet’s zero-forgetting property makes it exceptionally robust across drastic domain shifts — it often approaches or even slightly exceeds joint training for some tasks because it preserves past parameters exactly.
- Data-focused distillation methods like LwF struggle when new-task data is from a very different distribution (distillation targets are uninformative) — LwF can collapse in these settings.
- Prior-focused methods (MAS, EWC, SI) degrade less dramatically than LwF but still struggle compared to parameter isolation in very heterogeneous sequences.
- Overall: methods that hard-protect past knowledge are safer under extreme distribution shifts, but they may lack plasticity when capacity saturates.
Here are the RecogSeq results (a sequence of eight highly different tasks). You can see how parameter-isolation approaches retain past knowledge, while others degrade more.

Fig. 3: RecogSeq evaluation on each task as tasks accumulate. PackNet and some IMM variants remain comparatively higher across tasks; finetuning and LwF show stronger forgetting.
Task order — surprisingly small effect
The authors hypothesized that curriculum-like ordering (easy → hard) might systematically improve lifelong learning. Across experiments on Tiny ImageNet and iNaturalist, the effect of different task orderings (random, easy→hard, hard→easy, and related/unrelated orderings for iNaturalist) was marginal. Some methods reacted slightly to order, but there was no universal ordering that substantially changed the landscape.
On iNaturalist the authors explored orderings based on relatedness (measured via expert-gate autoencoders). PackNet and mode-IMM were among the most order-robust methods.

Fig. 4: iNaturalist evaluation under three orderings. Overall trends persist across orderings: PackNet remains strong, some regularizers are sensitive to ordering.
Qualitative trade-offs: compute, memory, privacy, task-agnosticism
The authors complement quantitative results with a qualitative table summarizing:
- GPU memory and compute overheads for training and inference.
- Additional storage requirements (e.g., exemplars for replay, task masks for PackNet, stored parameters for IMM).
- Whether a method requires a task label at test time (many parameter isolation methods do).
- Privacy implications (replay methods that store raw images can’t preserve privacy).
These trade-offs matter a lot in practice: PackNet’s masks are compact but require task IDs at inference; replay methods are powerful but store user data.
Deep dives: a few instructive experiments
I’ll highlight three illuminating extra studies from the paper.
Epoch sensitivity for GEM vs. iCaRL GEM was originally designed for an online (single-epoch) regime. When given many epochs, it can underperform. The authors found that for their setup GEM performed far better when limited to a small number of epochs (≈5) per task, matching the regime it was intended for. This reminds us that algorithm performance can depend on assumptions about online vs. offline operation.
HAT capacity allocation and failure modes HAT learns attention masks per task. In practice, the masks often allocate capacity asymmetrically across layers: early layers may saturate quickly while later layers retain capacity. On small, homogeneous tasks (Tiny ImageNet), HAT works well. But on diverse, large-scale tasks (iNaturalist, RecogSeq) HAT’s asymmetric allocation can starve later tasks of low-level features and cause poor performance. In contrast, PackNet’s pruning-based allocation yields a more uniform layer-wise capacity distribution.
Layer-wise usage visualizations show this asymmetry clearly (HAT can saturate Conv0 early on, making future learning impossible unless you dramatically change hyperparameters — which then causes forgetting).

Fig. 5: HAT’s cumulative weight usage per layer across tasks. The DEEP model can saturate early layers quickly, while the SMALL model shows a more gradual allocation.
- Long Tiny ImageNet (40 tasks) and plasticity after saturation Parameter isolation methods like PackNet avoid forgetting by freezing parameters. But what happens when you truly run out of capacity? In an extended experiment with 40 tasks, PackNet initially leads but eventually iCaRL (a replay method) can surpass it as tasks accumulate and PackNet’s fixed masks limit learning of novel distributions. Moreover, if you present a drastically different new task after capacity saturation (e.g., SVHN digit recognition after the sequence), PackNet fails to learn the new distribution (very low accuracy), while methods that trade some stability for plasticity (decay their penalties) can still adapt.
This highlights a central trade-off: parameter isolation gives stability at the cost of eventual plasticity unless you expand the model.

Fig. 6: Long Tiny ImageNet (40 tasks) evaluation. PackNet, iCaRL, and others are compared. PackNet preserves zero forgetting but can lose plasticity after saturation. The bottom panels show PackNet (top) and HAT (bottom) layer-wise usage in a long sequence.
Practical guidance (what should you try?):
Based on the experiments and their holistic analysis, here are practical recommendations for practitioners who want to deploy continual learning in classification pipelines (task-incremental with multi-head):
- If you can provide task labels at test time and you care most about avoiding forgetting (and model capacity is adequate), PackNet is a strong default: it reliably prevents forgetting and often attains the highest final accuracy.
- If storing raw exemplars is permissible (privacy constraints relaxed) and you can allocate a modest memory budget, iCaRL-like rehearsal with careful exemplar management is a powerful, simple baseline. Increasing the exemplar budget helps.
- If you cannot store data and want a memory-light solution, MAS (unsupervised importance estimation) tends to be robust and often outperforms EWC/SI under fair hyperparameter selection.
- Avoid very deep architectures without careful regularization — widening layers often outperforms adding depth in task-incremental setups.
- Use dropout to mitigate overfitting in methods that overfit; but be aware it can interfere with parameter-importance estimations for some methods.
- Carefully consider whether you need task-agnostic inference (no test-time task label). Many parameter isolation methods require the task ID, which may be unacceptable in some applications.
- Tune forgetting-related hyperparameters using only current-task data (the continual hyperparameter framework). Avoid the unrealistic strategy of validation over all tasks.
Where do we go from here? The desiderata for general continual learning
Task-incremental, multi-head classification is a convenient and important setting, but the research community should push toward more realistic desiderata:
- Constant memory footprint (memory should not grow with the number of tasks).
- No task boundaries (task labels unknown at training and test).
- Online learning (single-pass or low-latency updates).
- Forward transfer and zero-shot learning (use past knowledge to learn new tasks more quickly).
- Backward transfer (improve past tasks when learning new, related tasks).
- Task‑agnostic inference (don’t require a test-time oracle telling the task).
- Graceful, selective forgetting — free up memory for genuinely old/unimportant knowledge.
The surveyed methods make progress on pieces of this list, but none achieve all of these properties simultaneously. Designing methods that are robust, scalable, and task-agnostic in truly online settings remains an open challenge.
A schematic comparing continual learning to related fields (multi-task, transfer, meta-learning, domain adaptation, online learning) helps position the open problems:

Fig. 7: Continual learning emphasizes sequential adaptation without forgetting, whereas related fields prioritize parallel training, offline transfer, or episodic adaptation.
Closing thoughts
This survey by De Lange et al. is an excellent resource: it combines a clear taxonomy, a practical hyperparameter-selection framework, and a large-scale, carefully controlled empirical comparison that surfaces realistic strengths and limitations of existing approaches. Two major themes emerge:
- Method performance depends strongly on task characteristics (balance/imbalance, homogeneity, domain shift), model capacity, and realistic hyperparameter selection. Results obtained under unrealistic validation schemes can be misleading.
- No silver bullet yet exists: parameter isolation is powerful for stability, replay is practical and flexible, and importance-based regularizers are memory-efficient but sensitive. The right choice depends critically on application constraints (privacy, memory, need for task-agnostic inference, and expected domain shifts).
If you work on continual learning, this paper is a must-read: it provides solid experimental baselines, a rigorous evaluation protocol, and practical guidance that will help you build more reliable lifelong-learning systems.
Further reading: read the full survey for implementation details, per-method hyperparameter choices, and extensive appendix experiments (e.g., HAT capacity analysis, IMM variants, long‑sequence PackNet behavior). The empirical results and the continual hyperparameter selection framework are especially valuable if you plan to benchmark new algorithms or deploy CL systems in practice.
](https://deep-paper.org/en/paper/1909.08383/images/cover.png)