Introduction

In the evolving world of Artificial Intelligence, one of the most fascinating challenges is teaching machines to understand the world through multiple senses simultaneously—specifically, sight and language. This is the domain of Multi-Modal Representation Learning. We want models that can watch a video and understand a textual description of it, or vice versa.

Current state-of-the-art methods often rely on Contrastive Learning (like the famous CLIP model). These models work by pulling the representations of a matching video and text pair closer together in a mathematical space while pushing non-matching pairs apart. While effective, this approach has a flaw: it assumes a rigid one-to-one mapping between a video and a sentence.

But reality is messy. A video of a judge speaking involves facial expressions, gestures, background details, and temporal dynamics. The caption “a judge is talking” is a massive compression of that information. This creates an information imbalance.

In this post, we will dive into a paper that proposes a novel solution called CALM (Class-anchor-ALigned generative Modeling). instead of forcing a direct, rigid match between video and text, CALM introduces a “third party”—a set of Class Anchors—and uses generative modeling to align probability distributions. This approach handles uncertainty better and generalizes significantly better to new data.

The Problem: Modality Discrepancy

To understand why we need a new approach, we first need to visualize the limitation of current methods.

Videos contain subtle, dense semantic information. Textual descriptions are often brief and have limited expressive capacity. When we try to force these two uneven pieces of data into a direct match, we suffer from modality discrepancy.

Figure 1. (a) Videos contain subtle semantic information, whereas textual descriptions often have limited expressive capacity. This mismatch leads to an information imbalance and modality discrepancy between video and text, resulting in the collapse of diverse video features to a limited textual representation scope. (b) To address this issue,we propose a class-anchor-aligned generative modeling approach.Our method generates class probability distributions by aligning prompts with inputs from each modality, effectively bridging the modality gap and preserving the diverse semantics of video content.

As shown in Figure 1 (a) above, consider the text: “a judge is talking to a contestant.”

  • The Text: Is simple and specific.
  • The Video: Could depict a judge smiling, a judge frowning, a wide shot, or a close-up.

Contrastive learning struggles here because it treats the relationship as binary: match or no match. It doesn’t account for the fact that the video contains more information than the text. This leads to the model ignoring the rich diversity of the video features to fit the limited scope of the text.

The Solution: CALM Framework

The researchers propose CALM. The core intuition is brilliant but simple: Instead of directly mapping video to text, let’s map both of them to a shared set of external concepts, called Class Anchors.

If we can describe the video in terms of a list of generic actions (e.g., “talking,” “sitting,” “smiling”) and describe the text using the same list, we can align the probability distributions of these descriptions.

High-Level Architecture

The CALM framework consists of three main stages:

  1. Feature Extraction: Using pre-trained encoders (CLIP).
  2. Class Anchor Extraction: Creating a set of reference prompts.
  3. Cross-Modal Probabilistic Modeling: Using a Variational Autoencoder (VAE) to align the distributions.

Let’s look at the complete architecture:

Figure2.Anoverviewofour framework.Weemployclasslabels froman independentdataset,transformthemintoprompts,nd extract theirlinguisticfeatures toserveasclassanchors.Wethencomputeclass probabilitydistributionsforvideoandtextfeatures by measuringthesilaritsetweenteireaturesandtelasschors,ectivelyepresetingitr-modalandter-modalelatioips. Forodalityligment,wemployossodalprobabiliticaratioalautoecoderthattakes teintermodalprobabilityistrbtion as inputand reconstructs the intra-modal probability distribution to align the modalities in a shared latent space.

As seen in Figure 2, the model takes video frames and text, encodes them, compares them against a bank of “Class Anchors,” and then uses a VAE to handle the alignment. Let’s break this down step-by-step.

Step 1: Feature Extraction

First, the model needs to convert raw pixels and words into mathematical vectors. The authors use the pre-trained CLIP model for this.

For Video: The model samples \(T\) frames from a video. Each frame \(f_t\) is passed through the CLIP image encoder.

Equation 1

These frame features are then aggregated using a temporal fusion module to create a single video-level feature representation, V.

Equation 2

For Text: Similarly, the text description is tokenized and passed through the CLIP text encoder. The model uses the embedding of the [CLS] (classification) token as the sentence-level representation.

Equation 3

Step 2: Class Anchor Extraction

This is the unique twist of CALM. The researchers take a set of class labels from an independent dataset (like the Charades dataset, which contains action labels). They don’t use the dataset images, just the labels (words).

They wrap each label in a prompt template, such as “The content of [label]”.

Equation 4

These prompts are then encoded using the text encoder to create Class Anchors (\(p_k\)). A learnable positional embedding (\(e^{pos}\)) is added to help the model differentiate between anchors.

Equation 5

Now, we have a fixed set of “Anchors” (Concept vectors) that sit in the embedding space.

Step 3: Generating Probability Distributions

Instead of comparing Video V directly to Text S, the model calculates how similar the Video is to every single Class Anchor, and how similar the Text is to every single Class Anchor.

By applying a Softmax function to these similarities, the model generates a probability distribution for the video (\(V_p\)) and the text (\(S_p\)).

Equation 6

Equation 7

  • \(V_p\) represents the Inter-modal relationship (Video vs. Anchor).
  • \(S_p\) represents the Intra-modal relationship (Text vs. Anchor).

Aligning these distributions allows the model to capture the “shape” of the data’s semantics rather than just a single point in space.

Step 4: Cross-Modal Probabilistic VAE

Here is where the “Generative” part of the title comes in. The authors use a Cross-Modal Probabilistic Variational Autoencoder (VAE).

The goal is to model the uncertainty inherent in the video-text relationship. The VAE attempts to reconstruct the Text Distribution (\(S_p\)) given the Video Distribution (\(V_p\)).

The Encoder: The encoder takes the Video distribution \(V_p\) and maps it to a latent space distribution characterized by a mean (\(\mu\)) and variance (\(\sigma\)). Using the reparameterization trick, a latent variable \(z\) is sampled.

Equation 8

The Decoder: The decoder takes this latent variable \(z\) and tries to reconstruct the Text distribution \(\hat{S}_p\).

Equation 9

Why do this? By forcing the video representation to pass through a probabilistic latent space (\(z\)) to predict the text representation, the model learns a robust joint representation that can handle the ambiguity and “information imbalance” discussed earlier.

Step 5: The Training Objective

To train this VAE, the model maximizes the Evidence Lower Bound (ELBO).

Equation 10

Practically, this translates to minimizing a loss function composed of two parts:

  1. Reconstruction Loss (\(\mathcal{L}_{rec}\)): This ensures the predicted text distribution is close to the actual text distribution.

Equation 11

  1. KL Divergence (\(\mathcal{L}_{KL}\)): This acts as a regularizer, ensuring the learned latent distribution doesn’t stray too far from a standard normal distribution. This prevents the model from memorizing data and encourages smoothness in the latent space.

Equation 12

The final loss combines these generative losses with a standard task loss (like contrastive loss) to fine-tune the representations.

Equation 13

Experiments and Results

The researchers tested CALM on four major video-text benchmarks: MSR-VTT, DiDeMo, MSVD, and LSMDC. They evaluated performance on two tasks: Video Retrieval (finding the right video for a text query) and Video Captioning (writing a description for a video).

Video Retrieval Performance

The results show that CALM significantly outperforms existing state-of-the-art methods.

In-Domain Evaluation: When trained and tested on the same dataset (MSR-VTT), CALM achieved the highest Recall@1 (R@1) score. R@1 measures how often the correct video appears as the #1 result.

Table 1. Experimental results of video retrieval trained on MSR-VTT. indicates the out-of-distribution evaluation.

In Table 1, notice the 50.8% R@1 for CALM, beating the previous best (DiffusionRet) at 49.0%.

Out-of-Domain Evaluation (Generalization): This is the most impressive part. The columns with the arrow (\(\rightarrow\)) indicate testing on a dataset the model has never seen.

  • When trained on MSR-VTT and tested on DiDeMo, CALM scores 41.2%, significantly higher than the competitors.
  • This proves that using Class Anchors helps the model learn universal concepts rather than dataset-specific quirks.

We see similar success on other datasets like DiDeMo and LSMDC:

Table 2. Experimental results of video retrieval trained on DiDeMo.

Table 3.Experimental results of video retrieval trained on LSMDC

Video Captioning Performance

CALM is not just good at finding videos; it’s good at describing them.

Table 4.Experimental results of video captioning trained on MSRVTT. “MSVD MSR-VTT” indicates out-of-domain results, where the model is trained on MSVD and evaluated on MSR-VTT.

In Table 4, for the MSR-VTT dataset, CALM achieves a CIDEr score (a metric for caption quality) of 59.3, beating the runner-up (CLIP4Caption) by 2.3 points. The gap is even wider in the Out-of-Domain setting (MSVD \(\rightarrow\) MSR-VTT), showing a 5.5 point improvement in CIDEr.

Table 5.Experimental results of video captioning trained on MSVD.“MSR-VTT \\(\\mathrm { \\dot { \\Omega } M S V D ^ { \\cdots } }\\) indicates out-of-domain results.

Qualitative Analysis: Why does it work?

To understand how the model thinks, we can look at which Class Anchors are activated for specific videos.

Figure 3.Qualitative videoretrievalresultsonthe MSR-VTTdataset.Selected anchorscapture distinct semanticcues,eitheraligning sharedcontent()origlightingcomplemetaryinfomatiotodressmodalitybalance(b).Iter-modalandintramodalelass serveassupplementarysmanticcues,nhancingthesemanticalignmentbetweenvideoandtextandimprovingretrievalperfoance.

In Figure 3, we see two examples:

  • (a) Compatible Alignment: The text mentions “connecting something.” The model aligns this with anchors like “Taking something from a box” and “Holding a laptop.” Both video and text map to similar anchors.
  • (b) Information Imbalance: The text says “a man is talking on stage.” The video, however, shows a performance. The video aligns with “Someone is laughing,” while the text aligns with “Holding some clothes” (perhaps metaphorically or via dataset noise). However, because CALM aligns the distributions rather than exact features, it can bridge this gap by finding the overlap in the latent space defined by the anchors.

Ablation Studies

Are the anchors really necessary? Does the VAE matter? The authors performed ablation studies to verify their design choices.

Number of Class Anchors: They tested using 0, 50, 100, and 157 anchors.

Table 6. Comparison of the number of class anchors on video retrieval performance on MSR-VTT.

As shown in Table 6, performance steadily increases as more anchors are added. Interestingly, even using anchors from a completely different dataset (COCO instead of Charades) improved performance over the baseline, suggesting that simply having any distinct semantic reference points is helpful.

Generative vs. Discriminative Loss: They also tested replacing the VAE with simple loss functions (like Mean Squared Error or KL Divergence directly on the distributions) without the generative component.

Table 7. Comparison of generative and discriminative learning approaches on video retrieval performance on MSR-VTT for indomain and DiDeMo for out-of-domain evaluation.

Table 7 confirms that the VAE approach (CALM) yields the best results. The generative process allows the model to handle the uncertainty of the mapping much better than simple geometric distance minimization.

Conclusion

The CALM paper presents a significant step forward in Multi-Modal Representation Learning. By acknowledging that videos and texts rarely contain the exact same information, the authors moved away from rigid contrastive matching.

Instead, they introduced Class Anchors—a set of universal concepts—and a Generative VAE to align the probability distributions of video and text.

Key Takeaways:

  1. Modality Gap: Direct matching fails when one modality (video) is much richer than the other (text).
  2. Anchors Work: Using external class labels as “semantic pivots” helps align disparate data types.
  3. Generative Alignment: Modeling the alignment as a probabilistic generation task (generating text distribution from video distribution) captures uncertainty and improves generalization.

This approach not only achieves state-of-the-art results on standard benchmarks but, more importantly, shows robust performance on unseen datasets. This suggests that CALM is learning true semantic understanding, rather than just memorizing dataset patterns.