Deep learning models are exceptional at learning specific tasks. Train a model to classify dogs, and it will do it perfectly. But ask that same model to learn how to classify cars afterward, and you encounter a notorious problem: Catastrophic Forgetting. In the process of learning about cars, the model completely forgets what a dog looks like.

This is the central challenge of Continual Learning (CL)—how can we teach machines to learn sequentially, task after task, just like humans do, without erasing previous knowledge?

Most current research focuses on preserving the “backbone” of the neural network (the part that extracts features from images). However, a recent paper titled “KAC: Kolmogorov-Arnold Classifier for Continual Learning” suggests we might be looking in the wrong place. The researchers propose that the problem—and the solution—lies in the classifier itself.

Inspired by the recent buzz around Kolmogorov-Arnold Networks (KANs), the authors introduce a novel classification layer that replaces standard linear classifiers. By swapping out dot products for Gaussian Radial Basis Functions (RBFs), they achieve state-of-the-art results in preventing catastrophic forgetting.

In this post, we’ll break down why linear classifiers fail at memory, why standard KANs initially failed to fix it, and how the Kolmogorov-Arnold Classifier (KAC) offers a robust solution.


The Problem with Linear Classifiers

To understand why forgetting happens, we first need to look at how modern classifiers work. In a typical Class Incremental Learning (CIL) setup, you have a backbone (like a Vision Transformer) that turns an image into a feature vector (an embedding). You then feed this vector into a Linear Classifier.

Mathematically, a linear classifier calculates a logit (score) \(l\) based on the feature embedding \(F(x)\) and a weight matrix \(h\):

Equation showing linear classifier logic.

In simple terms, this is a dot product. The classifier checks the similarity between the image features and the learned class weights.

The problem arises when you add new tasks. A linear classifier is global. When the model updates its weights to learn a new class (e.g., “Car”), it adjusts weights that affect the entire feature space. This inevitably interferes with the weights that were fine-tuned for the old class (“Dog”).

Comparison between Linear Classifier and KAC.

As shown in Figure 1(a) above, a conventional linear classifier activates irrelevant weights across all tasks. When the model tries to learn Task 2, it clumsily overwrites the knowledge from Task 1.

The authors of the paper argue that we need Locality. We need a classifier that only updates the specific parts of the network relevant to the current input, leaving the rest of the knowledge base untouched.


The Inspiration: Kolmogorov-Arnold Networks (KAN)

This is where the Kolmogorov-Arnold Network (KAN) comes into play. Unlike Multi-Layer Perceptrons (MLPs) which put activation functions on the nodes, KANs place learnable activation functions on the edges (connections).

KANs are based on the Kolmogorov-Arnold Representation Theorem, which states that multivariate functions can be represented as sums of univariate functions:

Kolmogorov-Arnold representation theorem equation.

In a standard KAN, these univariate functions \(\phi\) are parameterized as B-splines (a type of piecewise polynomial).

KAN activation function using splines.

Why is this exciting for memory? Splines are local. If you tweak a spline curve at one point, it doesn’t change the shape of the curve at a distant point. In theory, this means a KAN should be able to learn new information in one region of the feature space without disturbing old information in another region.

The Failed Experiment: Why Vanilla KAN Didn’t Work

Driven by this theory, the researchers tried a straightforward experiment: they took existing Continual Learning methods and simply replaced the linear classifier with a standard KAN layer using B-splines.

The result? It got worse.

Performance comparison showing vanilla KAN performing poorly.

As Figure 2 illustrates, the “KAN” (green line) and “KAN w/o Shortcut” (orange line) performed significantly worse than the standard linear baseline (gray line).

The Curse of Dimensionality

Why did the sophisticated KAN fail? The answer lies in the Curse of Dimensionality (COD).

While splines work beautifully for low-dimensional data (like the toy regression tasks used in the original KAN paper), they struggle immensely with high-dimensional data, such as image embeddings from a Vision Transformer (often 768 dimensions or more).

A single spline layer cannot effectively approximate high-dimensional compositional structures. To compensate for the spline’s weakness, the model forced the backbone to change drastically to accommodate the new tasks. These drastic changes to the backbone shattered the feature space, leading to more catastrophic forgetting than a simple linear layer would have caused.


The Solution: The Kolmogorov-Arnold Classifier (KAC)

The researchers realized that the structure of KAN (learnable activations on edges) was correct, but the basis function (B-splines) was wrong for this application. They needed a function that could handle high dimensions and maintain the property of locality.

They found their answer in Radial Basis Functions (RBF), specifically Gaussian functions.

Switching Splines for Gaussians

The researchers replaced the B-splines with Gaussian RBFs. An RBF is a function whose value depends only on the distance from a center point.

Equation for Gaussian RBF.

Here, \(c_i\) represents a center point (a specific location in the feature space) and \(\sigma\) represents the width of the Gaussian curve.

When you use Gaussian RBFs in a KAN structure, the activation for each dimension becomes a Gaussian Mixture Model:

Gaussian mixture model equation.

This is a game-changer. Gaussian mixtures are excellent at modeling distributions in high-dimensional spaces. They naturally create “bumps” in the feature space. If an input falls far away from a center \(c_i\), the output is near zero—meaning that part of the network effectively “sleeps” and isn’t updated. This restores the locality that B-splines promised but failed to deliver.

The Architecture of KAC

The proposed Kolmogorov-Arnold Classifier (KAC) works as a plug-and-play replacement for linear classifiers. Here is the pipeline:

  1. Input: Take the feature embedding \(F(x)\) from the backbone.
  2. Layer Norm: Normalize the features to stabilize training.
  3. RBF Activation: Pass the features through a set of learnable Gaussian RBFs.
  4. Weighting: Multiply by a learnable weight matrix to get the final class scores.

Pipeline of the KAC.

Figure 3 visualizes this process. Notice how the RBFs (in the middle) map the input to different Gaussian distributions. This allows the model to “select” specific activation ranges for each channel.

The final mathematical formulation of the classifier looks like this:

Final KAC equation.

Unlike a linear classifier that extends infinitely in all directions, the KAC creates a bounded, local decision boundary. If a new task occupies a different part of the feature space, the Gaussians for the old task simply won’t activate, protecting them from being overwritten.


Visualizing Locality: Why KAC Remembers

The primary claim of this paper is that KAC mitigates forgetting through locality. We can verify this by looking at the activation maps.

In a linear classifier, you would typically see activations spread across many channels. In KAC, because of the Gaussian RBFs, we expect to see sparse, specific activations.

Heatmap of activation maps.

Figure 4 shows the activation levels of different feature channels (x-axis) for different classes (y-axis).

  • Red areas indicate high activation (interest).
  • Blue areas indicate low activation.

Notice how distinct the patterns are. For any given class, only a small subset of channels is highly activated. When the model updates for “Task 1,” it primarily adjusts the weights in the red regions for Task 1. Since Task 2 uses different channels (or different ranges within those channels), the knowledge for Task 1 remains largely undisturbed.


Experimental Results

The researchers integrated KAC into several popular prompt-based Continual Learning methods (L2P, DualPrompt, CODAPrompt, CPrompt) and tested them on standard benchmarks like ImageNet-R and CUB200.

The rule was simple: Keep everything the same (backbone, hyperparameters), but replace the Linear Classifier with KAC.

ImageNet-R Results

Table of results on ImageNet-R.

Table 1 shows the results on ImageNet-R.

  • Green numbers indicate improvement over the baseline.
  • Red numbers indicate a decrease.

The results are overwhelmingly positive. KAC consistently improves performance, particularly in the “Last” accuracy column (the accuracy after all tasks are learned).

Ideally, we want to see high performance in the 20-step and 40-step scenarios (Long Sequence tasks), as these are the hardest tests of memory. In the 40-step scenario, KAC improved the “Last” accuracy of DualPrompt by 4.93% and CODAPrompt by 4.97%. In the world of Continual Learning, a 5% jump without changing the backbone is significant.

CUB200 Results

CUB200 is a fine-grained dataset (classifying different species of birds). This requires distinguishing between very similar features.

Table of results on CUB200.

As shown in Table 2, the improvements here are even more dramatic.

  • In the 10-step scenario, L2P with KAC improved by 14.49%.
  • In the 40-step scenario, DualPrompt with KAC improved by a massive 27.14%.

This suggests that KAC’s ability to create precise, local decision boundaries is particularly beneficial for fine-grained tasks where classes are clustered closely together in the feature space.

Robustness on Domain Incremental Learning

The team also tested KAC on DomainNet (Table 3 below) to see if it works for Domain Incremental Learning (where the domain changes, e.g., Sketch -> Painting -> Real Photo).

Table of results on DomainNet.

While the gains are smaller (around 1-2%), KAC still consistently outperforms the linear baseline, proving its robustness across different types of incremental learning scenarios.


Does More Complexity Mean Better Performance?

A skeptic might ask: “Is KAC better just because it has more parameters?”

To test this, the authors ran an ablation study where they replaced the RBFs with a standard Multi-Layer Perceptron (MLP) that had the same number of parameters.

Ablation study table comparing KAC to MLP.

Table 4 shows the result. Adding an MLP (whether fixed or trainable) did not improve performance significantly—in fact, a fixed MLP hurt performance.

This confirms that the magic isn’t in the parameter count; it is in the structure. The Kolmogorov-Arnold architecture combined with Gaussian RBFs provides the specific geometric properties needed to separate tasks in the feature space effectively.

Conclusion

The Kolmogorov-Arnold Classifier (KAC) represents a clever application of mathematical theory to a practical engineering problem. By recognizing that “how” we classify is just as important as “what” features we extract, the authors have found a way to significantly reduce catastrophic forgetting.

Key Takeaways:

  1. Linear Classifiers are Bottlenecks: Their global update nature causes interference between tasks.
  2. B-Splines behave poorly in high dimensions: Vanilla KANs struggle with image embeddings due to the Curse of Dimensionality.
  3. RBFs save the day: By using Gaussian RBFs within the KAN framework, KAC creates local, stable decision boundaries.
  4. Plug-and-Play: KAC can replace the final layer of almost any Continual Learning model to boost performance, especially in long, complex task sequences.

As we move toward AI systems that can learn lifelong skills without needing constant retraining, architectural innovations like KAC will likely play a foundational role. It serves as a reminder that sometimes, the best way to move forward is to revisit the fundamental components of our networks—even the humble classifier.