Humans are remarkable lifelong learners. We can master a new language or videogame without forgetting how to ride a bike or play an old instrument. This ability to continuously acquire new skills while retaining old ones is a hallmark of intelligence. For artificial intelligence, however, this has been a monumental challenge.
When a standard neural network learns a new task, it often falls victim to catastrophic forgetting — the tendency to overwrite information learned earlier. It’s like a musician who learns the guitar but forgets the piano. This limits artificial intelligence from “living” and learning in a real-world, open-ended environment.
A 2019 paper titled “Compacting, Picking and Growing for Unforgetting Continual Learning” proposes an elegant solution. The authors introduce CPG (Compacting, Picking, and Growing) — a dynamic learning loop that allows a model to master an indefinite sequence of tasks without forgetting while keeping its architecture surprisingly compact and efficient.
Let’s dive into how it works.
The Landscape of Continual Learning
Before we explore CPG, it helps to understand how other approaches handle catastrophic forgetting — and where they fall short.
Regularization-Based Methods: These add penalties during training to discourage the model from changing important weights learned from old tasks. It’s like placing “Do Not Disturb” signs on the network’s critical parameters. While helpful, these methods often degrade slowly over many tasks because data from older tasks isn’t available, making long-term retention impossible.
Memory Replay Methods: Inspired by human memory, these keep a short-term cache of old data or train separate generative models (e.g., GANs) to replay synthetic samples from earlier tasks. This helps reinforce forgotten patterns but demands extra memory and complex retraining, which grows impractical over time.
Dynamic Architectures: Pioneers like ProgressiveNet preserve old knowledge perfectly by freezing weights of previously learned tasks and expanding the architecture for each new task. The trade-off? Model bloat. After 20 tasks, you end up with a network roughly 20 times bigger than the original — effective but staggeringly inefficient.
CPG cleverly combines the strengths of these methods: the flawless recall of dynamic expansion, the efficiency of compression, and the transferability of shared learning — all within a single, sustainable cycle.
The Core Idea: The CPG Learning Cycle
The CPG process unfolds in three iterative stages for each new task: Compact, Pick, and Grow. This cycle forms the foundation for sustainable lifelong learning.

Figure 1: Conceptual overview of the Compacting, Picking, and Growing (CPG) continual learning process. Each new task builds on an efficient base by pruning redundancy, reusing prior knowledge, and expanding sparingly.
Step 1: Compacting — Building the Foundation
Everything begins with the first task. The model is trained from scratch and then compacted using gradual pruning — a process that silently removes redundant weights while retaining full performance.
Instead of pruning a large chunk of weights in one go, gradual pruning:
- Removes a tiny fraction of low-magnitude weights.
- Retrains the remaining ones to restore lost accuracy.
- Repeats this iterative prune-and-retrain process until optimal sparsity is achieved.
After compacting, two distinct sets of weights emerge:
- Preserved Weights (\(W_1^P\)) — essential parameters for Task 1, now frozen to prevent forgetting.
- Released Weights (\(W_1^E\)) — redundant parameters freed for the next task.
Step 2: Picking — Selective Knowledge Reuse
When Task 2 arrives, the network intelligently reuses what it already knows. Rather than starting entirely anew, it applies a binary mask \(M\) to the preserved weights \(W_1^P\), choosing which parts of past knowledge remain useful.
Each mask entry determines whether a particular old weight contributes:
- \(M_i = 1\): reuse that feature.
- \(M_i = 0\): ignore it for the new task.
These preserved weights remain fixed — like a read-only memory of Task 1 — while the mask learns which of them to activate. Meanwhile, the released weights \(W_1^E\) are free to adapt for the new task.
Thus, training on Task 2 updates only:
- The mask \(M\), which learns how to pick useful features; and
- The new weights \(W_1^E\), which learn task-specific information.
The total weights for Task 2 become:
\[ W_2 = M \odot W_1^P \cup W_1^E \]— a seamless blend of reused and new knowledge.
Step 3: Growing — Expanding Only When Needed
What if Task 2 is far more complex or differs completely from Task 1? If the combination of picked and released weights cannot achieve a predefined accuracy goal, the network grows — adding new filters or neurons just enough to meet performance demands.
Unlike ProgressiveNet, which grows for every new task, CPG expands only when necessary. This “minimalist growth” prevents hierarchical explosion while ensuring continual adaptability.
Completing the Loop
After Task 2 training, the new weights are themselves compacted again via pruning, generating a fresh set of preserved weights \(W_2^P\) and another pool of released weights \(W_2^E\). The model’s cumulative memory is now \(W_{1:2}^P = W_1^P \cup W_2^P\).
This cyclical Compact–Pick–Grow loop repeats for each subsequent task, enabling a lifelong learner that grows smarter and leaner with every experience.
Experiments: How Well Does CPG Work?
The authors rigorously tested CPG across diverse scenarios — from benchmark image datasets to realistic facial recognition pipelines.
Experiment 1: CIFAR-100 — Learning 20 Tasks Sequentially
The CIFAR-100 dataset was divided into 20 tasks, each with 5 classes. The model learned all tasks in sequence using the VGG16-BN architecture.

Figure 2: Accuracy trends for Tasks 1, 5, 10, and 15 as more tasks are learned. CPG maintains consistent performance (flat red line), demonstrating zero forgetting, while DEN suffers catastrophic decline.
The results are revealing — CPG maintains near-constant performance for earlier tasks even after 20 rounds of training, whereas competing methods like DEN experience steep accuracy collapse. Fine-tuning performs moderately but also forgets previous tasks over time.
Moreover, later tasks benefit from accumulated experience. CPG achieves higher initial accuracy than fine-tuning, reflecting positive knowledge transfer from its compact history.
The Power of Picking
To understand how much the “Picking” mechanism matters, the authors compared CPG against:
- PackNet: Compacts but doesn’t grow.
- PAE (Pack-and-Expand): Compacts and grows, but reuses all old weights indiscriminately.

Table 1: CPG consistently achieves the highest average accuracy and remains most compact. Selectively picking useful weights is more effective than reusing all old weights.
The findings are decisive: when weights are selectively picked rather than all reused, performance improves. PAE’s indiscriminate reuse introduces inertia that hinders adaptation, and PackNet eventually runs out of capacity. CPG handles both issues gracefully.
Model Efficiency
Compared to training 20 separate models (which would require 20× parameters), CPG packs all tasks into a model just 1.5× the original size — a tremendous efficiency gain.

Figure/Table: CPG compresses the knowledge of 20 tasks into a model roughly 2.4× the size of a single one, instead of the 20× needed when storing individual networks.
Experiment 2: Fine-Grained Classification
CPG was next tested on six fine-grained datasets — starting from ImageNet and sequentially learning specialized tasks like birds (CUBS), cars, flowers, art paintings, and sketches — using a ResNet50 backbone.

Table 6: On fine-grained image classification benchmarks, CPG surpasses strong baselines like fine-tuning from ImageNet, maintaining both precision and compactness.
Here too, CPG excelled. After compressing the large ResNet and progressively learning subtasks, its performance beats fine-tuning and other continual learning methods like Piggyback or ProgressiveNet, while occupying the smallest memory footprint. ProgressiveNet’s unbounded expansion reaches hundreds of megabytes per task, while CPG stays lean at approximately one-fifth the total size.
Experiment 3: Real-World Facial Learning
For a realistic multi-domain application, CPG learned four facial-informatic tasks sequentially: face verification, gender classification, expression recognition, and age estimation. Starting with the SphereFace CNN, each task reused and compacted prior knowledge.

Table 7: CPG integrates facial tasks into one unified model with virtually no expansion (1×) while matching or exceeding fine-tuned performances.
Even after tackling all four very different objectives, CPG’s total expansion remained effectively 1×, meaning the same network accommodated all tasks without forgetting — a compelling demonstration of real-world applicability.
Why It Matters
CPG isn’t just another academic experiment; it marks an important step toward truly lifelong AI. By combining pruning (compression), masking (reuse), and selective expansion (adaptation):
- It eliminates catastrophic forgetting — old knowledge is preserved exactly.
- It grows efficiently — the network expands minimally and compactly over time.
- It enables positive transfer learning — prior experience enhances new skill acquisition.
This makes CPG ideal for systems meant to evolve — from robotics and autonomous driving to personal assistants that must continuously learn without ever resetting their memories.
Just as humans learn by refining what they know and adding new experiences, CPG demonstrates that to grow smarter, an AI may first need to become more compact.
](https://deep-paper.org/en/paper/1910.06562/images/cover.png)