SD-LoRA: How Foundation Models Can Learn Continuously Without Forgetting

Imagine teaching an AI to recognize animals. You train it to identify dogs and cats—it gets really good at it. Then you teach it birds and fish. Suddenly, when you ask it to recognize a cat again, it struggles. It seems to have forgotten what it learned before.

This frustrating phenomenon, known as catastrophic forgetting, is one of the biggest hurdles in creating AI systems that can learn continuously like humans.

Modern “foundation models”—large-scale pre-trained neural networks such as Vision Transformers (ViTs)—excel at transferring knowledge across tasks. But when they’re asked to learn sequentially, as in Continual Learning (CL), they often falter. Most existing solutions either store massive “prompt pools” or rehearse old data to remember previous tasks—approaches that are inefficient and unsustainable in real deployments.

Wouldn’t it be ideal if AI could learn new tasks without forgetting old ones, without storing data, and without slowing down inference?

That’s the problem tackled in the research paper “SD-LORA: Scalable Decoupled Low-Rank Adaptation for Class Incremental Learning.” The proposed method—Scalable Decoupled LoRA (SD-LoRA)—makes foundation models better at continual learning by separating what they learn into magnitude and direction. With this design, the model becomes rehearsal-free, inference-efficient, and fully end-to-end trainable for continual learning.

In this explainer, we’ll explore how SD-LoRA works, why it’s effective both empirically and theoretically, and how it surpasses other state-of-the-art CL approaches.

The Challenge of Continual Learning

Traditional machine learning assumes all training data is available upfront. In contrast, Continual Learning (CL) asks a model to learn sequentially—adapting to new tasks one after another. For example, a robot might first learn navigation, then object recognition, then manipulation skills.

The central challenge is maintaining a balance between plasticity (learning new things) and stability (retaining what’s learned). Catastrophic forgetting occurs when new learning interferes with old knowledge, destroying stability.

In class-incremental learning, a demanding form of CL, the model learns new classes over time and must classify samples from any known class—without being told which task they belong to. This pushes the model to generalize across all learned tasks continuously.

Existing CL techniques fall into two camps:

Prompt-based methods: These tune small input or intermediate representations called prompts. Each new input must select the right prompt from an ever-growing pool—hurting inference speed and scalability.
Rehearsal-based methods: These store samples from old tasks to replay during new training rounds, helping prevent forgetting but violating data privacy and scalability constraints.

The authors of SD-LoRA designed their method to achieve all three desirable CL properties simultaneously:

Rehearsal-Free: No need to store data from previous tasks.
Inference-Efficient: No prompt selection or task-specific modules during testing.
End-to-End Optimized: All parameters are trained together for the continual learning objective.

Table 1 shows how SD-LoRA stands out by satisfying all three conditions compared to existing approaches.

Table comparing continual learning methods by rehearsal-free property, inference efficiency, and end-to-end optimization; SD-LoRA is the only one fulfilling all three.

Table 1: Comparison of foundation-model-based CL methods. SD-LoRA uniquely offers rehearsal-free learning, efficient inference, and end-to-end optimization.

Background: How LoRA Enables Efficient Fine-Tuning

Before diving into SD-LoRA, let’s revisit Low-Rank Adaptation (LoRA)—a lightweight fine-tuning technique commonly used with large models.

Instead of updating the full weight matrix \( W_0 \) of a model layer (which could contain millions of parameters), LoRA adds a compact, trainable low-rank update. The original weights are frozen, and only the small matrices \( A \) and \( B \) are trained to represent the update:

\[ \Delta W = A B, \quad \text{with} \; \text{rank}(r) \ll \min(m, n) \]

This drastically reduces the number of parameters that need training.

LoRA architecture showing frozen pre-trained weights and small trainable low-rank update ∆W = AB added to a layer.

Figure 1(a): LoRA fine-tuning adds a low-rank update \( \Delta W = A B \) to frozen pretrained weights \( W_0 \).

The updated layer output becomes:

\[ \boldsymbol{h}' = (\mathbf{W}_0 + \mathbf{A}\mathbf{B}) \boldsymbol{x} \]

LoRA works well for adapting to a single task, but repeated fine-tuning for many tasks leads to interference and forgetting. SD-LoRA addresses this next.

Decoupling Magnitude and Direction: The Core of SD-LoRA

Any LoRA update \( \Delta W = A B \) inherently contains two elements:

\[ \Delta \mathbf{W} = \|\mathbf{A}\mathbf{B}\|_F \cdot \frac{\mathbf{A}\mathbf{B}}{\|\mathbf{A}\mathbf{B}\|_F} \]

The magnitude (Frobenius norm): \( \|\mathbf{A}\mathbf{B}\|_F \)
The direction (normalized matrix): \( \frac{\mathbf{A}\mathbf{B}}{\|\mathbf{A}\mathbf{B}\|_F} \)

Past work found that the direction matters more for transfer learning success than raw magnitude. Inspired by this, SD-LoRA learns magnitude and direction separately and incrementally.

When training the first task \( \mathcal{T}_1 \):

Learn LoRA component \( A_1 B_1 \) (both magnitude and direction).

For later tasks (\( \mathcal{T}_2, \mathcal{T}_3, ... \)):

Freeze old directions: keep previously learned \( \overline{A_k B_k} \).
Add a new trainable direction: \( \overline{A_t B_t} \) for the current task.
Learn magnitudes: scalars \( \{\alpha_k\} \) that weight all directions.

Thus, after task \( \mathcal{T}_t \), a layer’s updated output becomes:

\[ h' = (W_0 + \alpha_1 A_1 B_1 + \alpha_2 A_2 B_2 + \cdots + \alpha_t A_t B_t) x \]

SD-LoRA architecture combining multiple low-rank components, each weighted by a learnable magnitude α while preserving previously learned directions.

Figure 1(b): In SD-LoRA, earlier task directions are frozen while new tasks introduce trainable magnitudes and directions.

This design lets the model reuse old directions (stability) and introduce new ones (plasticity). The learned magnitudes \( \alpha_k \) dynamically re-weight contributions from prior tasks—preventing catastrophic forgetting.

The Mechanism Behind SD-LoRA’s Success

Empirical analysis reveals three important findings that explain how SD-LoRA succeeds where others fail.

Finding 1: Fine-Tuned Models Cluster in a Shared Region

The authors fine-tuned ViT models on five ImageNet-R tasks separately and measured distances between their optimal weights. They found these solutions cluster together in weight space—indicating a shared low-loss region encompassing many tasks.

Heatmap of relative distances between five task-specific weight solutions showing they cluster closely compared to the foundation model.

Figure 2(a): Optimal weights from different tasks lie close together, suggesting a shared low-loss area.

When the first task’s direction was fixed and only magnitudes adjusted for later tasks, performance surpassed vanilla LoRA (which overwrites directions).

Performance comparison lines showing LoRA with fixed direction outperforming vanilla LoRA over increasing number of tasks.

Figures 2(b)-(c): Fixing the initial direction yields better accuracy than continually changing directions.

This suggests that well-chosen early directions guide the model toward regions beneficial to all tasks.

Finding 2: Early Directions Matter Most

Over time, SD-LoRA learns new directions for each task—but analysis shows early ones dominate. At first, new directions align closely with old ones, gradually diverging later.

Plots showing residuals and magnitude evolution, where early α values grow while later ones decline, indicating diminishing importance of new directions.

Figures 3(a)-(c): Early LoRA directions are reused the most; corresponding magnitudes increase while later ones diminish.

The magnitudes for initial tasks (\( \alpha_1, \alpha_2 \)) tend to be higher, implying these directions underpin most of the model’s capabilities.

Finding 3: SD-LoRA Follows a Low-Loss Path Across Tasks

By interpolating between weights learned for sequential tasks, the authors observed that SD-LoRA maintains high performance for earlier tasks while improving on new ones—unlike vanilla LoRA, which suffers accuracy drops.

Diagram showing SD-LoRA trajectory overlapping low-loss regions for multiple tasks; vanilla LoRA path shows loss on earlier tasks.

Figure 4(a)-(c): SD-LoRA traces a trajectory through overlapping low-loss regions, preserving prior task accuracy.

This indicates SD-LoRA’s optimization naturally converges to a region that jointly minimizes loss for all tasks—achieving continual learning without rehearsal.

Making It Leaner: SD-LoRA Variants

Because later directions contribute less, the authors introduced two parameter-efficient variants:

SD-LoRA-RR (Rank Reduction): Reduce the rank of LoRA matrices for later tasks:
\[ r_{1} = \dots > r_{\mu} > \dots > r_{\nu} = \dots = r_{N} \]
This lowers memory and computation cost with minimal performance loss.
SD-LoRA-KD (Knowledge Distillation): Before adding a new direction, check if it can be expressed as a linear combination of existing directions:
\[ \{\Delta \alpha_k\}_{k=1}^{t-1} = \arg\min_{\{\alpha'_k\}} \left\| \overline{A_t B_t} - \sum_{k=1}^{t-1}\alpha'_k \overline{A_k B_k} \right\|_F^2 \]
If the residual error is below a threshold, the new knowledge is merged into existing magnitudes \( \alpha_k \leftarrow \alpha_k + \Delta \alpha_k \), avoiding any parameter expansion.

Putting SD-LoRA to the Test

The team evaluated SD-LoRA and its variants on several continual learning benchmarks—ImageNet-R, ImageNet-A, and DomainNet—using ViT-B/16 backbones.

Performance table showing SD-LoRA outperforming all competing continual learning methods on ImageNet-R with varying task counts.

Table 2: SD-LoRA delivers the highest accuracy across tasks and scales better as task count increases.

Performance table showing SD-LoRA leading results on ImageNet-A and DomainNet benchmarks.

Table 3: SD-LoRA demonstrates robust results across difficult benchmarks.

Across all datasets, SD-LoRA consistently outperforms existing approaches, achieving top scores in both overall accuracy (Acc) and average anytime accuracy (AAA). The margins widen as the number of tasks grows—showing superior scalability.

Accuracy curves over multiple tasks comparing SD-LoRA and baselines; SD-LoRA maintains the highest line with minimal degradation over time.

Figure 5: SD-LoRA maintains high accuracy as tasks increase, outperforming competing methods.

In ablation studies, removing magnitude rescaling or direction decoupling noticeably hurt accuracy, proving both are essential for stability–plasticity balance.

Efficiency comparisons further highlight SD-LoRA’s strengths—it matches the computational cost of the most efficient methods (InfLoRA) while requiring zero stored data. The rank-reduction variant (SD-LoRA-RR) is especially resource-friendly.

Efficiency table comparing computational cost, learnable parameters, and storage; SD-LoRA achieves competitive FLOPs and zero rehearsal storage.

Table 5: SD-LoRA matches the most efficient methods with no stored samples and minimal parameters.

Conclusion: A Step Toward Lifelong Learning in AI

SD-LoRA offers a compelling recipe for true continual learning in foundation models:

Rehearsal-Free: It learns sequentially without data storage.
Inference-Efficient: The final model handles all tasks directly—no selection logic needed.
End-to-End Optimized: All parameters jointly adapt for the CL objective.

Its mechanism—decoupling magnitude and direction—reveals how foundation models can evolve through low-loss trajectories shared across tasks. Early learned directions shape long-term stability, while later magnitudes fine-tune adaptability.

This approach could reshape how AI learns continuously, enabling models that grow smarter over time without forgetting what made them intelligent in the first place. As research continues, SD-LoRA’s principles might extend beyond ViTs to other architectures and fine-tuning strategies, paving the way for scalable, lifelong learning systems.

The Challenge of Continual Learning#

Background: How LoRA Enables Efficient Fine-Tuning#

Decoupling Magnitude and Direction: The Core of SD-LoRA#

The Mechanism Behind SD-LoRA’s Success#

Finding 1: Fine-Tuned Models Cluster in a Shared Region#

Finding 2: Early Directions Matter Most#

Finding 3: SD-LoRA Follows a Low-Loss Path Across Tasks#

Making It Leaner: SD-LoRA Variants#

Putting SD-LoRA to the Test#

Conclusion: A Step Toward Lifelong Learning in AI#