Introduction
In the evolving world of Artificial Intelligence, one of the most fascinating challenges is teaching machines to understand the world through multiple senses simultaneously—specifically, sight and language. This is the domain of Multi-Modal Representation Learning. We want models that can watch a video and understand a textual description of it, or vice versa.
Current state-of-the-art methods often rely on Contrastive Learning (like the famous CLIP model). These models work by pulling the representations of a matching video and text pair closer together in a mathematical space while pushing non-matching pairs apart. While effective, this approach has a flaw: it assumes a rigid one-to-one mapping between a video and a sentence.
But reality is messy. A video of a judge speaking involves facial expressions, gestures, background details, and temporal dynamics. The caption “a judge is talking” is a massive compression of that information. This creates an information imbalance.
In this post, we will dive into a paper that proposes a novel solution called CALM (Class-anchor-ALigned generative Modeling). instead of forcing a direct, rigid match between video and text, CALM introduces a “third party”—a set of Class Anchors—and uses generative modeling to align probability distributions. This approach handles uncertainty better and generalizes significantly better to new data.
The Problem: Modality Discrepancy
To understand why we need a new approach, we first need to visualize the limitation of current methods.
Videos contain subtle, dense semantic information. Textual descriptions are often brief and have limited expressive capacity. When we try to force these two uneven pieces of data into a direct match, we suffer from modality discrepancy.

As shown in Figure 1 (a) above, consider the text: “a judge is talking to a contestant.”
- The Text: Is simple and specific.
- The Video: Could depict a judge smiling, a judge frowning, a wide shot, or a close-up.
Contrastive learning struggles here because it treats the relationship as binary: match or no match. It doesn’t account for the fact that the video contains more information than the text. This leads to the model ignoring the rich diversity of the video features to fit the limited scope of the text.
The Solution: CALM Framework
The researchers propose CALM. The core intuition is brilliant but simple: Instead of directly mapping video to text, let’s map both of them to a shared set of external concepts, called Class Anchors.
If we can describe the video in terms of a list of generic actions (e.g., “talking,” “sitting,” “smiling”) and describe the text using the same list, we can align the probability distributions of these descriptions.
High-Level Architecture
The CALM framework consists of three main stages:
- Feature Extraction: Using pre-trained encoders (CLIP).
- Class Anchor Extraction: Creating a set of reference prompts.
- Cross-Modal Probabilistic Modeling: Using a Variational Autoencoder (VAE) to align the distributions.
Let’s look at the complete architecture:

As seen in Figure 2, the model takes video frames and text, encodes them, compares them against a bank of “Class Anchors,” and then uses a VAE to handle the alignment. Let’s break this down step-by-step.
Step 1: Feature Extraction
First, the model needs to convert raw pixels and words into mathematical vectors. The authors use the pre-trained CLIP model for this.
For Video: The model samples \(T\) frames from a video. Each frame \(f_t\) is passed through the CLIP image encoder.

These frame features are then aggregated using a temporal fusion module to create a single video-level feature representation, V.

For Text:
Similarly, the text description is tokenized and passed through the CLIP text encoder. The model uses the embedding of the [CLS] (classification) token as the sentence-level representation.

Step 2: Class Anchor Extraction
This is the unique twist of CALM. The researchers take a set of class labels from an independent dataset (like the Charades dataset, which contains action labels). They don’t use the dataset images, just the labels (words).
They wrap each label in a prompt template, such as “The content of [label]”.

These prompts are then encoded using the text encoder to create Class Anchors (\(p_k\)). A learnable positional embedding (\(e^{pos}\)) is added to help the model differentiate between anchors.

Now, we have a fixed set of “Anchors” (Concept vectors) that sit in the embedding space.
Step 3: Generating Probability Distributions
Instead of comparing Video V directly to Text S, the model calculates how similar the Video is to every single Class Anchor, and how similar the Text is to every single Class Anchor.
By applying a Softmax function to these similarities, the model generates a probability distribution for the video (\(V_p\)) and the text (\(S_p\)).


- \(V_p\) represents the Inter-modal relationship (Video vs. Anchor).
- \(S_p\) represents the Intra-modal relationship (Text vs. Anchor).
Aligning these distributions allows the model to capture the “shape” of the data’s semantics rather than just a single point in space.
Step 4: Cross-Modal Probabilistic VAE
Here is where the “Generative” part of the title comes in. The authors use a Cross-Modal Probabilistic Variational Autoencoder (VAE).
The goal is to model the uncertainty inherent in the video-text relationship. The VAE attempts to reconstruct the Text Distribution (\(S_p\)) given the Video Distribution (\(V_p\)).
The Encoder: The encoder takes the Video distribution \(V_p\) and maps it to a latent space distribution characterized by a mean (\(\mu\)) and variance (\(\sigma\)). Using the reparameterization trick, a latent variable \(z\) is sampled.

The Decoder: The decoder takes this latent variable \(z\) and tries to reconstruct the Text distribution \(\hat{S}_p\).

Why do this? By forcing the video representation to pass through a probabilistic latent space (\(z\)) to predict the text representation, the model learns a robust joint representation that can handle the ambiguity and “information imbalance” discussed earlier.
Step 5: The Training Objective
To train this VAE, the model maximizes the Evidence Lower Bound (ELBO).

Practically, this translates to minimizing a loss function composed of two parts:
- Reconstruction Loss (\(\mathcal{L}_{rec}\)): This ensures the predicted text distribution is close to the actual text distribution.

- KL Divergence (\(\mathcal{L}_{KL}\)): This acts as a regularizer, ensuring the learned latent distribution doesn’t stray too far from a standard normal distribution. This prevents the model from memorizing data and encourages smoothness in the latent space.

The final loss combines these generative losses with a standard task loss (like contrastive loss) to fine-tune the representations.

Experiments and Results
The researchers tested CALM on four major video-text benchmarks: MSR-VTT, DiDeMo, MSVD, and LSMDC. They evaluated performance on two tasks: Video Retrieval (finding the right video for a text query) and Video Captioning (writing a description for a video).
Video Retrieval Performance
The results show that CALM significantly outperforms existing state-of-the-art methods.
In-Domain Evaluation: When trained and tested on the same dataset (MSR-VTT), CALM achieved the highest Recall@1 (R@1) score. R@1 measures how often the correct video appears as the #1 result.

In Table 1, notice the 50.8% R@1 for CALM, beating the previous best (DiffusionRet) at 49.0%.
Out-of-Domain Evaluation (Generalization): This is the most impressive part. The columns with the arrow (\(\rightarrow\)) indicate testing on a dataset the model has never seen.
- When trained on MSR-VTT and tested on DiDeMo, CALM scores 41.2%, significantly higher than the competitors.
- This proves that using Class Anchors helps the model learn universal concepts rather than dataset-specific quirks.
We see similar success on other datasets like DiDeMo and LSMDC:


Video Captioning Performance
CALM is not just good at finding videos; it’s good at describing them.

In Table 4, for the MSR-VTT dataset, CALM achieves a CIDEr score (a metric for caption quality) of 59.3, beating the runner-up (CLIP4Caption) by 2.3 points. The gap is even wider in the Out-of-Domain setting (MSVD \(\rightarrow\) MSR-VTT), showing a 5.5 point improvement in CIDEr.

Qualitative Analysis: Why does it work?
To understand how the model thinks, we can look at which Class Anchors are activated for specific videos.

In Figure 3, we see two examples:
- (a) Compatible Alignment: The text mentions “connecting something.” The model aligns this with anchors like “Taking something from a box” and “Holding a laptop.” Both video and text map to similar anchors.
- (b) Information Imbalance: The text says “a man is talking on stage.” The video, however, shows a performance. The video aligns with “Someone is laughing,” while the text aligns with “Holding some clothes” (perhaps metaphorically or via dataset noise). However, because CALM aligns the distributions rather than exact features, it can bridge this gap by finding the overlap in the latent space defined by the anchors.
Ablation Studies
Are the anchors really necessary? Does the VAE matter? The authors performed ablation studies to verify their design choices.
Number of Class Anchors: They tested using 0, 50, 100, and 157 anchors.

As shown in Table 6, performance steadily increases as more anchors are added. Interestingly, even using anchors from a completely different dataset (COCO instead of Charades) improved performance over the baseline, suggesting that simply having any distinct semantic reference points is helpful.
Generative vs. Discriminative Loss: They also tested replacing the VAE with simple loss functions (like Mean Squared Error or KL Divergence directly on the distributions) without the generative component.

Table 7 confirms that the VAE approach (CALM) yields the best results. The generative process allows the model to handle the uncertainty of the mapping much better than simple geometric distance minimization.
Conclusion
The CALM paper presents a significant step forward in Multi-Modal Representation Learning. By acknowledging that videos and texts rarely contain the exact same information, the authors moved away from rigid contrastive matching.
Instead, they introduced Class Anchors—a set of universal concepts—and a Generative VAE to align the probability distributions of video and text.
Key Takeaways:
- Modality Gap: Direct matching fails when one modality (video) is much richer than the other (text).
- Anchors Work: Using external class labels as “semantic pivots” helps align disparate data types.
- Generative Alignment: Modeling the alignment as a probabilistic generation task (generating text distribution from video distribution) captures uncertainty and improves generalization.
This approach not only achieves state-of-the-art results on standard benchmarks but, more importantly, shows robust performance on unseen datasets. This suggests that CALM is learning true semantic understanding, rather than just memorizing dataset patterns.
](https://deep-paper.org/en/paper/2503.17417/images/cover.png)