If you spend any time on the internet, you know that video content is exploding. From YouTube tutorials to TikTok trends, the volume of video data generated daily is staggering. For Artificial Intelligence, specifically Video Question Answering (VideoQA) models, this presents a massive challenge.
We typically train these models on static datasets. Once trained, they are frozen. But the world isn’t static. If we want an AI to understand new types of content or answer new kinds of questions, we usually have to fine-tune it. Here lies the problem: when you teach a Large Language Model (LLM) new tricks, it often forgets the old ones. This phenomenon is known as Catastrophic Forgetting.
Today, we are diving deep into a research paper titled “Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting”. The researchers propose a novel framework called ColPro (Collaborative Prompting) that allows LLMs to learn continuously from video without losing their previous knowledge.
If you are a student interested in Multimodal AI, Continual Learning, or Prompt Engineering, this paper offers a masterclass in how to manage the trade-off between plasticity (learning new things) and stability (remembering old things).
The Problem: The “Static” Mindset vs. The Real World
To understand why this research matters, we first need to look at how standard VideoQA models operate. Traditionally, a model is trained on a fixed set of videos and questions. If you want to update it, you have two expensive options:
- Retrain from scratch: Include old and new data (extremely computationally expensive).
- Fine-tune on new data: Train only on the new tasks.
Option 2 sounds efficient, but it leads to the “catastrophic forgetting” mentioned earlier. Imagine studying for a biology exam so intensely that you completely forget how to do basic algebra. That is what happens to these models.

As illustrated in Figure 1 (a) above, when a standard model is fine-tuned sequentially on Task 1 through Task N, its feature space shifts. In the example, the model initially knew how to analyze a skiing video. But after being fine-tuned on new tasks, it answers incorrectly (“showing his joy” instead of “to balance himself”).
In contrast, Figure 1 (b) shows the goal of the proposed ColPro framework. By using specific prompts for specific tasks within a Continual Learning (CL) setup, the model retains the specific knowledge required to answer the skiing question correctly, even after learning new tasks.
The Solution: Collaborative Prompting (ColPro)
The researchers propose that we shouldn’t retrain the entire “brain” (parameters) of the LLM. Instead, we should use Prompt Tuning.
Prompt tuning involves freezing the massive pre-trained LLM (like LLaMA) and only training small, learnable vectors (prompts) that guide the model’s behavior. ColPro takes this a step further by introducing a “collaborative” set of prompts that handle three distinct aspects of the video QA problem:
- The Question Context: Understanding what kind of question is being asked (e.g., “Why?” vs. “Where?”).
- The Visual Content: Understanding the video frames and their changes over time.
- The Knowledge Acquisition: Combining everything to generate the answer.
The Architecture
Let’s look at how this is structured. The framework uses a pre-trained LLaMA model as its backbone.

As shown in Figure 2 (Left), the model takes a video, a question, and answer candidates as input. The core innovation happens in the ColPro Guided Pre-trained Layers.
Instead of just feeding raw text into the LLM, ColPro injects learnable prompts into the Key and Value matrices of the LLM’s self-attention mechanism.
The mathematical operation for this attention mechanism is:

Here, \(\mathbf{P}_k\) and \(\mathbf{P}_v\) are the learnable prompts attached to the keys and values. This allows the model to “attend” to these learned instructions alongside the actual data.
The Three Pillars of ColPro
The “Collaborative” in ColPro refers to three specific prompting strategies working together. Let’s break them down.
1. Task-Specific Question Constraint Prompting (TQCP)
In Continual Learning, knowing which task you are solving is half the battle. If the model knows the current task is about “counting objects,” it shouldn’t waste resources looking for “causal reasons.”
TQCP helps the model identify the question type. It uses a Negative Guiding Approach. This means the prompt is trained to be:
- Positively correlated with the current question type (e.g., “How many?”).
- Negatively correlated with other question types (e.g., “What color?”).
The loss function for this strategy combines generation loss (generating the question) and negative contrastive loss:

The negative component is particularly interesting:

In this equation, the model tries to maximize the similarity (\(sim\)) between the prompt \(\mathbf{P}_e\) and the positive question samples (\(\mathbf{Q}^+\)), while minimizing similarity with negative samples (\(\mathbf{Q}^-\)). This effectively creates a “boundary” in the model’s understanding, preventing confusion between different tasks.
2. Visual Temporal Awareness Prompting (VTAP)
LLMs are text-native. They don’t naturally “see” video or understand time. VTAP is designed to bridge this gap. It forces the prompts to capture Visual Content and Temporal Dynamics (how things change over time).
This is achieved via two mechanisms:
- Temporal Dynamics: Using the LLM’s autoregressive nature to predict the order of video frames. If the model can predict what frame comes next, it understands the video’s flow.
- Video Distillation: A contrastive loss that aligns the prompt representation with the visual features extracted from the video encoder (like CLIP).

By optimizing this loss, the prompts become “visually aware,” acting as a translator between the pixel data and the language model.
3. Knowledge Acquisition Prompting (KAP)
Finally, we need to answer the actual question. KAP is the integration layer. It takes the specific constraints from the question (TQCP) and the visual understanding (VTAP) and uses them to predict the correct answer among the choices.

This standard cross-entropy loss ensures the model generates the correct answer token given the question, video, and the learned prompts.
Experimental Setup
To prove this works, the authors tested ColPro on two challenging datasets: NExT-QA and DramaQA.
They didn’t just run standard tests; they split these datasets into distinct tasks based on question types (e.g., Causal questions, Descriptive questions, Temporal questions) and fed them to the model sequentially to simulate a changing environment.

Figure 4 gives you a sense of the difficulty.
- Task 1 asks a temporal question: “What was the boy doing before…?”
- Task 7 asks a causal/descriptive question about a cat in a sink.
In a standard setting, learning about the cat (Task 7) might make the model forget the boy (Task 1).
Results: Does it Work?
The results are quite compelling. The researchers compared ColPro against several state-of-the-art Continual Learning methods, including L2P, DualPrompt, and ProgPrompt.
Performance on NExT-QA

Looking at Table 1, we focus on two metrics:
- Avg. Acc (\(\uparrow\)): Higher is better. It measures overall accuracy.
- Avg. Fog (\(\downarrow\)): Lower is better. It measures Forgetting (how much accuracy dropped on old tasks).
ColPro achieves an accuracy of 55.14%, beating the next best method (ProgPrompt) by over 1%. More importantly, look at the Forgetting (Avg. Fog). ColPro sits at 7.43%, while the standard LLaMA baseline is nearly double at 13.83%. This proves the prompts are successfully protecting old knowledge.
Performance on DramaQA

The gap is even wider on the DramaQA dataset (Table 2). ColPro achieves 71.24% accuracy, which is significantly higher than ProgPrompt’s 67.92%. The forgetting rate is also the lowest among all comparison methods.
Visualizing the Stability
Numbers in a table are great, but seeing the learning trajectory is better.

Figure 3 plots the average accuracy as the model learns tasks 1 through 8.
- Notice the Red Line with Stars (Ours/ColPro).
- While other methods (like the blue and green lines) dip significantly as new tasks are introduced (the “sawtooth” pattern of forgetting), ColPro maintains a much higher and more stable accuracy throughout the learning process.
Why do we need all three prompts? (Ablation Study)
You might wonder if we really need TQCP, VTAP, and KAP. Couldn’t we just use one? The researchers performed an ablation study to find out.

Table 4 shows the breakdown:
- Row 1: Using only the Answer loss (\(\mathcal{L}_a\)) gives 52.60% accuracy.
- Row 2: Adding the Question constraint (\(\mathcal{L}_q\)) jumps accuracy to 53.09% and drops forgetting to 9.09%.
- Row 4: Using all three (Answer + Question + Video) yields the best results: 55.14% accuracy and 7.43% forgetting.
This confirms that the “Collaborative” aspect is essential. The model needs to explicitly learn the question type and the video dynamics to maximize memory retention.
Conclusion and Implications
The “Collaborative Prompting” (ColPro) paper tackles a realistic and pressing problem in AI: making models adaptable without making them forgetful. By avoiding full-model fine-tuning and instead using smart, multi-faceted prompting strategies, the authors demonstrate a path forward for Continual Video Question Answering.
Key Takeaways for Students:
- Efficiency: You don’t always need to retrain a massive model. Clever prompting strategies can adapt frozen models to new tasks efficiently.
- Modality Gap: Bridging the gap between text (LLMs) and video requires explicit guidance (like the VTAP strategy) to handle temporal dynamics.
- Negative Learning: Sometimes, teaching a model what not to look for (via negative contrastive loss) is as important as teaching it what to look for.
As video content continues to dominate the digital landscape, techniques like ColPro will be essential for creating AI assistants that can grow and learn alongside us, remembering the past while understanding the present.
](https://deep-paper.org/en/paper/2410.00771/images/cover.png)