Introduction
Imagine teaching a child to recognize a dog. Once they learn that, you teach them to recognize a cat. Ideally, learning about the cat shouldn’t make them forget what a dog looks like. This is the essence of Continual Learning (CL). Humans are naturally good at this; artificial neural networks, however, are not. When deep learning models are trained on new data classes sequentially, they tend to suffer from “Catastrophic Forgetting”—they optimize for the new task and overwrite the weights necessary for the old ones.
The problem becomes exponentially harder when dealing with video. Unlike static images, videos contain complex temporal dynamics (movement, sequence, duration). Video Class-Incremental Learning (VCIL) aims to solve this: how can we train a model to recognize “playing basketball” today, “swimming” next week, and “playing guitar” next month, without it forgetting the previous activities?
Traditional solutions often involve storing old data (which raises privacy and memory concerns) or expanding the model architecture indefinitely (which is computationally expensive). A more recent, elegant solution is Prompt Learning, where we keep a large pre-trained model frozen and learn small, lightweight instructions (prompts) to guide it.
However, existing prompt methods have a flaw: they are usually deterministic. They try to learn a single, fixed set of prompts for a task. But videos are diverse. A video of “swimming” can look vastly different depending on the angle, the pool, or the lighting. A single prompt often isn’t expressive enough to capture this distribution.
In this post, we are diving deep into a new paper: “Learning Conditional Space-Time Prompt Distributions for Video Class-Incremental Learning”. The researchers propose a method called CoSTEP. Instead of learning a fixed prompt, they teach a Diffusion Model to generate a distribution of prompts conditioned on the specific input video. This allows the model to generate a unique, tailored prompt for every single video instance, leading to state-of-the-art results.
Background: The Evolution of Prompting
To understand why CoSTEP is revolutionary, we first need to understand where it fits in the lineage of AI prompting.
The Challenge of Catastrophic Forgetting
In a standard Class-Incremental Learning (CIL) setup, a model receives tasks sequentially. Task 1 might be “Animals,” Task 2 “Vehicles,” and so on. If you fine-tune the whole model on Task 2, the weights shift to minimize error for vehicles, destroying the patterns learned for animals.
From Rehearsal Buffers to Prompt Pools
Early solutions used a “rehearsal buffer,” saving a few examples of animals to mix in while training on vehicles. While effective, this is “cheating” in scenarios where privacy is strict (e.g., hospital data) or memory is limited.
This led to Prompt-based CIL. Here, we use a massive Vision Transformer (ViT)—like CLIP—that already knows how to “see.” We freeze this giant brain. To teach it a new task, we don’t change its brain; we just change the input “question” or prompt.
As shown in the image below, early methods like L2P and DualPrompt (Figure 1a) used a Prompt Pool. The model sees an image, extracts a query key, and looks up the best-matching prompt from a bank of learnable vectors. Later methods like CODA-Prompt (Figure 1b) improved this by using a deterministic Prompt Generator to create prompts on the fly.

The Problem with Deterministic Approaches: Look at (a) and (b) in the figure above. They are deterministic. If the input features are similar, the prompt is similar or identical. This works okay for static images, but videos are complex. Two videos of the same class can be wildly different. A fixed-size pool or a deterministic generator often fails to capture the rich, diverse distribution of video data. They struggle to generalize to future tasks because they overfit to the specific examples they see during training.
The CoSTEP Solution: Look at (c) in Figure 1. This is the method we are analyzing today. Instead of picking a prompt or calculating one directly, CoSTEP models the probability distribution \(p(\mathbf{P}|\hat{\mathbf{E}})\)—the probability of a prompt \(\mathbf{P}\) given the video features \(\hat{\mathbf{E}}\). By sampling from this distribution, the model can generate diverse, informative prompts that better cover the “space” of possible video representations.
Core Method: CoSTEP
The CoSTEP framework is built on two major pillars:
- Space-Time Prompt Learning: A clever way to encode video data into prompts without heavy temporal modules.
- Prompt Diffusion: Using a diffusion model (similar to the tech behind DALL-E or Stable Diffusion) to generate these prompts.
Let’s break down the architecture.

As Figure 2 illustrates, the training is split into two logical stages (though they work in tandem). First, we need to know what a good prompt looks like (Stage 1). Then, we train a diffusion model to generate those good prompts from pure noise (Stage 2).
Pillar 1: Space-Time Prompt Learning (The “Target”)
Standard Vision Transformers (ViTs) are pre-trained on images, not videos. They handle spatial features (what is in the frame) well but struggle with temporal features (how things move over time).
Previous methods tried to fix this by adding complex temporal attention modules or just averaging frame features (which destroys temporal order). CoSTEP takes a simpler, more elegant approach utilizing a Frame Grid.
The Frame Grid Strategy
Instead of processing frames sequentially, CoSTEP samples two types of context:
- Spatial Context (\(N_{sp}\)): High-resolution frames to understand appearance.
- Temporal Context (\(N_{tp}\)): A larger number of frames to understand motion.
Crucially, the temporal frames are down-sampled and stacked into a grid that matches the size of a single image. This allows the pre-trained Image ViT to process a sequence of time as if it were a single “spatial” image. This “tricks” the pre-trained image model into performing temporal reasoning using its existing spatial attention mechanisms. This involves no new parameters for the backbone, yet effectively captures the relationship between local patches across different frames.

As shown in Figure 3, the video is processed into embeddings \(\hat{\mathbf{E}}\). These embeddings serve as the condition for the prompt generation.
Pillar 2: Prompt Diffusion (The “Generator”)
This is the most innovative part of the paper. The goal is to learn a distribution of prompts. To do this, the authors employ a Denoising Diffusion Probabilistic Model (DDPM).
The Intuition
Think of a “perfect prompt” as a clear signal. A diffusion model works by taking that clear signal and slowly adding static (noise) until it’s unrecognizable. Then, a neural network is trained to do the reverse: look at the static and predict what the clear signal used to be.
If we train a network to remove noise, we can eventually give it pure noise, and it will “hallucinate” a valid prompt. In CoSTEP, this hallucination is guided (conditioned) by the video features \(\hat{\mathbf{E}}\).
The Mathematical Framework
1. The Forward Process (Adding Noise): We start with a “clean” prompt \(\mathbf{P}_0\). We gradually add Gaussian noise over \(T\) steps. At any step \(t\), the noisy prompt \(\mathbf{P}_t\) can be described as a mixture of the original prompt and noise \(\epsilon\).

We can jump directly to any noisy stage \(\mathbf{P}_t\) using a variance schedule \(\bar{\alpha}_t\):

2. The Reverse Process (Training the Denoise): We want to train a network (a U-Net) to predict the noise \(\epsilon\) that was added. If we can predict the noise, we can subtract it to get back to the clean prompt. The loss function \(\ell_{diffusion}\) measures how close the predicted noise is to the actual noise added.

Here, \(\epsilon_{\theta}\) is the neural network trying to predict the noise, conditioned on the current noisy prompt \(\mathbf{P}_t\), the time step \(t\), and crucially, the video features \(\hat{\mathbf{E}}\).
3. Inference (Generating Prompts): Once trained, how do we classify a new video?
- Extract video features \(\hat{\mathbf{E}}\) using the frozen ViT.
- Generate random Gaussian noise \(\mathbf{P}_T\).
- Use the trained diffusion network to iteratively denoise it, step by step, using the reverse formula:

This results in a clean prompt \(\mathbf{P}_0\) specifically dreamed up for this video.
4. The Final Objective: The model is trained with a combined objective. It must minimize the diffusion loss (learning to generate prompts) and the classification loss (ensuring those prompts actually help classify the video correctly).

This equation maximizes the probability of the correct label \(y_i^k\) given the video and the generated prompt.
Experiments & Results
The theory sounds great, but does it work? The authors tested CoSTEP on standard video action recognition benchmarks: UCF101, HMDB51, and Something-Something V2 (SSv2).
Quantitative Performance
The results are summarized in Table 1 below. The metrics to watch are:
- Acc (Average Accuracy): Higher is better. It measures how well the model identifies classes across all tasks learned so far.
- State-of-the-Art Comparison: CoSTEP is compared against methods like L2P, DualPrompt, and the previous best video method, ST-Prompt.

Key Takeaways from Table 1:
- Dominance: CoSTEP achieves the highest accuracy across all settings.
- Gap: On HMDB51 (5 tasks), CoSTEP beats ST-Prompt by over 1.5%. On SSv2 (a dataset heavily reliant on temporal motion), the improvement is consistent, proving the effectiveness of the Space-Time grid.
Table 2 shows results on a different split (vCLIMB), focusing on BWF (Backward Forgetting). Lower BWF means the model remembers old tasks better.

Key Takeaways from Table 2:
- CoSTEP has incredibly low forgetting (around 2%), significantly lower than PIVOT and other competitors. The diffusion process acts as a robust regularizer, preventing the model from overwriting past knowledge.
Why Does It Work Better? (Ablation Studies)
The authors performed ablation studies to prove that it’s the Diffusion and the Frame Grid doing the heavy lifting.
Is Diffusion necessary? Table 3 compares CoSTEP against deterministic methods (L2P, DAP) and other generative methods (GANs, VAEs).

CoSTEP outperforms not only the deterministic methods but also other probabilistic approaches like VAEs and GANs. This suggests that the diffusion process captures the complex, high-dimensional distribution of video prompts better than these alternatives.
Is the Frame Grid necessary? The authors tested removing the spatial or temporal components.

As shown in Table 6, simple “Mean-pooling” (averaging frames) or “Max-pooling” yields significantly lower results than the CoSTEP “Frame grid.” This validates the hypothesis that stacking frames allows the Image ViT to infer temporal relationships between patches effectively.
Visualizing the Learned Prompts
Numbers are good, but seeing the latent space is better. The authors used t-SNE (a dimensionality reduction technique) to visualize the prompts generated by CoSTEP versus the deterministic CODA-Prompt.
Cross-Task Separation: In Figure 5, each color represents a different task.

- Left (CODA-Prompt): The prompts are somewhat mixed.
- Right (CoSTEP): The tasks are clearly separated in the prompt space. The red arrows show the diffusion trajectory—starting from random noise (red stars) and converging into clean, task-specific clusters.
Within-Task Diversity: In Figure 6, we look inside a single task. Colors represent different classes (e.g., “biking” vs. “diving”).

- Left (CODA-Prompt): All prompts for a task are collapsed into a few tight points. There is very little diversity. The prompt for “biking” is almost identical to “diving.”
- Right (CoSTEP): We see distinct, rich clusters for different classes. The diffusion model generates diverse prompts that respect the nuances of each specific class and video instance.
Accuracy Over Time
Finally, let’s look at how accuracy holds up as new tasks are added.

In Figure 4, look at the blue line (CoSTEP). It starts higher and stays higher than the competitors (Orange and Green). This indicates superior “plasticity” (learning new things) and “stability” (remembering old things).
Conclusion & Implications
CoSTEP represents a significant step forward in Video Class-Incremental Learning. By shifting from deterministic prompt selection to probabilistic prompt generation, the authors have created a system that is:
- More Adaptive: It generates unique prompts tailored to the specific spatio-temporal dynamics of each input video.
- More Robust: It drastically reduces catastrophic forgetting by modeling the underlying distribution of tasks rather than memorizing examples.
- Efficient: It achieves this without storing raw video data or adding massive temporal modules to the backbone.
The integration of Diffusion Models into prompt learning is a trend we are likely to see more of. It bridges the gap between generative AI (creating data) and discriminative AI (classifying data). For students and researchers in Computer Vision, CoSTEP highlights the power of conditional generative models not just for creating art, but for guiding the attention of large foundational models in complex, dynamic environments.
](https://deep-paper.org/en/paper/file-2106/images/cover.png)