Beyond “He Looks”: Generating Distinctive Audio Descriptions for Movies with DistinctAD

Imagine watching a movie with your eyes closed. You are relying entirely on a narrator to describe the action. Now, imagine a tense scene where a character slowly realizes they are being watched. The narrator says: “He looks.” A few seconds later: “He looks at something.” Then: “He looks again.”

Frustrating, right? You miss the nuance—the widening of the eyes, the glance at the pill bottle, the shadowy figure in the doorway. This is the reality of many automated Audio Description (AD) systems today. While they can identify a person and a general action, they often fail to capture the specific, distinctive details that drive a narrative forward.

In this deep dive, we are exploring a new framework called DistinctAD, proposed by researchers from City University of Hong Kong and Baidu Inc. This paper addresses two critical failures in current AI narrators: the domain gap between training data and movies, and the tendency of models to repeat themselves when scenes look similar. By the end of this post, you will understand how DistinctAD uses a two-stage approach involving “Contextual Expectation-Maximization Attention” to turn repetitive, boring captions into vivid, story-driven descriptions.

The Problem: Why is Movie Narration So Hard?

Audio Description (AD) is an accessibility service that translates visual information into spoken text for blind and low-vision audiences. Unlike standard video captioning (e.g., “A dog running on grass”), movie AD must fit into the natural gaps in dialogue and contribute to a coherent story.

The researchers identify two main hurdles preventing current AI from doing this well:

  1. The Domain Gap: Most Vision-Language Models (VLMs), like CLIP, are trained on web data—static images paired with short, literal captions. Movie ADs are different; they use literary language, character names, and describe complex temporal actions. A model trained on YouTube clips or Google Images struggles to speak “cinema.”
  2. Contextual Redundancy: This is the subtle killer of quality. In a movie, distinct actions often happen in the same setting with the same characters. If you take three consecutive 5-second clips from a scene, the visual features (the room, the lighting, the actor’s clothes) are 90% identical. Because the context is repetitive, standard AI models get lazy. They output the same generic description for consecutive clips (e.g., “She stands in the room”) rather than describing the unique change in that specific moment.

Comparison of previous methods vs. DistinctAD. Part (a) shows repetitive outputs like ‘Rebecca looks at someone.’ Part (b) shows DistinctAD generating nuanced captions like ‘Rebecca’s eyes widen.’

As shown in Figure 1, previous methods treat every clip in isolation or fail to filter out the background noise, resulting in repetitive “safe” guesses. DistinctAD (part b) looks at the sequence of clips to figure out what makes this specific moment unique.

Stage I: Bridging the Domain Gap with CLIP-AD

Before the model can generate good stories, it needs to understand the visual language of movies. The authors use CLIP (Contrastive Language-Image Pre-training) as their foundation. However, they discovered an interesting anomaly:

  • If you take an AD text and feed it into CLIP’s text encoder, a language model (like GPT-2) can easily reconstruct the sentence. The text encoder understands the movie language well enough.
  • However, if you take the video frames and feed them into CLIP’s vision encoder, the reconstruction is poor.

This suggests the CLIP vision encoder is the weak link—it doesn’t know how to extract the narrative-heavy features from movie frames.

To fix this, the researchers propose Stage-I: CLIP-AD Adaptation. This isn’t just simple fine-tuning; it’s a dual-pronged approach to alignment.

Illustration of Stage-I: CLIP-AD Adaptation showing global and fine-grained matching branches.

1. Global Video-AD Matching

The first step is standard contrastive learning. The model takes a batch of video clips and their corresponding AD sentences. It tries to maximize the similarity between the correct video-text pairs while minimizing the similarity with incorrect ones.

The loss function for matching video \(v\) to AD text \(AD\) is calculated as:

Equation for global video-AD contrastive loss.

This ensures that, at a high level, the vision encoder learns that a specific scene corresponds to a specific sentence.

2. Fine-Grained Frame-AD Matching

Movies are “Multiple-Instance” problems. An AD sentence might say, “Harry sees a glow at the bottom of the deep shaft.” The word “glow” might only correspond to the last 20 frames of the clip, while “Harry” appears in all of them. A global average might miss these specific correspondences.

The researchers introduce a fine-grained mechanism. They treat the video as a bag of frames and the text as a sequence of words. They use an attention mechanism to calculate which words correspond to which frames:

Equation for frame-aware AD representation using softmax attention.

Here, \(\tilde{\mathbf{T}}_i\) represents the “frame-aware” text embeddings. Essentially, for every frame, the model constructs a text representation based on the words that match that specific frame best.

They then apply a Multiple-Instance Loss to align individual frames with these frame-aware text representations:

Equation for fine-grained frame-AD matching loss.

By combining global and fine-grained matching, the researchers produce a new vision encoder, \(\text{CLIP}_{\text{AD}}\), that is specifically tuned to “see” narrative elements in movie frames.

Stage II: Distinctive AD Narration

With a better vision encoder in hand, the framework moves to the core challenge: solving Contextual Redundancy.

If a character sits at a dinner table for five minutes, the visual features for “table,” “dinner,” and “sitting” are present in every single frame. If the model focuses on these strong features, it will just keep saying “He sits at the table.”

To generate distinctive descriptions, the model must:

  1. Identify the common “background” information across a sequence of clips.
  2. Suppress that redundancy.
  3. Highlight the unique changes.

The authors achieve this using a pipeline involving a Perceiver, a Contextual EMA module, and a Distinctive Word Loss.

Pipeline of Stage-II showing the Perceiver, Contextual EMA, and LLM integration.

The Setup

The model takes \(N\) consecutive video clips. These are processed by the adapted \(\text{CLIP}_{\text{AD}}\) encoder. A Perceiver module (a type of transformer) resamples these features into a fixed size, creating a sequence of clip vectors.

Equation showing the Perceiver processing CLIP features.

Contextual EMA (Expectation-Maximization Attention)

This is the heart of DistinctAD. The goal of Contextual EMA is to clean up the visual signal. It uses the Expectation-Maximization (EM) algorithm, which is typically used for clustering data.

Here, the “data” are the visual features from the sequence of clips, and the “clusters” (or bases) represent the common visual themes (e.g., the background scenery, the main character’s face).

Step 1: Responsibility Estimation (E-Step) The model estimates how much each frame belongs to a certain “base” (cluster). This is like asking: “Is this feature part of the background, or part of the action?”

Equation for Responsibility Estimation in EMA.

Step 2: Likelihood Maximization (M-Step) The model updates the bases to better represent the features assigned to them. It calculates the weighted average of the input features.

Equation for Likelihood Maximization updating the bases.

By iterating these steps, the model separates the visual information into compact bases. The researchers then reconstruct the visual features using these bases:

Equation for reconstructing features using bases.

Why do this? Because the reconstructed features \(\widehat{\mathcal{H}}\) represent the “common” or “compact” version of the scene. To find what makes the current clip special, the model also performs Cross-Attention between the original features and the learned bases:

Equation for Cross-Attention between original features and bases.

The Cross-Attention mechanism allows the model to re-weight the information, effectively focusing on the specific bases that are relevant to the current moment while ignoring the redundant ones.

Visualizing the Effect: The impact of this process is visualized below. In (a), raw features are scattered. In (c), after Contextual EMA, the features are organized into clear, strip-like distributions pointing to specific bases. This structure makes it much easier for the Language Model to distinguish between different visual elements.

Visualizations of Contextual EMA showing data clustering and feature compaction.

Finally, these processed features are summed and projected into the LLM’s embedding space:

Equation for final feature projection.

Explicit Distinctive Word Prediction

Even with perfect visual features, Language Models (LLMs) have their own bad habits. They love to repeat words. If the model generated “The man walks” in the previous sentence, the probability of it generating “man” and “walks” again remains high.

To force the LLM to use new vocabulary, DistinctAD introduces a specific loss function. The researchers create a set of distinctive words (\(w_d\)) by looking at the ground-truth ADs for the neighboring clips and filtering out duplicates (like names and stopwords). The remaining words are the “unique” terms for the current clip.

The model is trained with a standard auto-regressive loss (predicting the next word):

Equation for standard auto-regressive loss.

Plus a distinctive loss that specifically boosts the probability of those unique words:

Equation for distinctive word prediction loss.

This forces the LLM to “hunt” for the specific nouns and verbs that define the current action, rather than falling back on generic descriptions.

Experiments and Results

The researchers evaluated DistinctAD on three major benchmarks: MAD-Eval, CMD-AD, and TV-AD. They compared it against “Training-free” methods (like GPT-4V prompted with images) and other “Partial-fine-tuning” methods (like AutoAD).

Quantitative Success

The results on MAD-Eval show a clear advantage for DistinctAD.

  • CIDEr: A measure of how similar the generated text is to human reference text. DistinctAD achieves 27.3, significantly higher than AutoAD-II (19.5) and GPT-4V based methods (around 9.8 - 13.5).
  • Recall@k/N: This is a crucial metric for this specific paper. It measures how well the generated AD matches the specific ground truth among \(N\) neighbors. A high score here means the description is distinct enough to be distinguished from descriptions of adjacent clips. DistinctAD reaches 56.0, setting a new state-of-the-art.

Similar success is seen on the CMD-AD benchmark:

Table showing comparisons on CMD-AD. DistinctAD achieves competitive scores in CIDEr and Recall.

Ablation Studies: Does it all matter?

The authors performed rigorous testing to ensure every part of the complex pipeline was necessary.

  • Stage-I Impact: Using the adapted \(\text{CLIP}_{\text{AD}}\) instead of standard CLIP consistently improved scores across all configurations.
  • Stage-II Impact: As shown in Table 5, adding the reconstructed features (\(\widehat{\mathcal{H}}\)) and the distinctive loss (\(\mathcal{L}_{dist}\)) step-by-step yielded incremental improvements. The combination of all elements (row C3) gave the best result.

Table 5 showing ablation studies for components in Stage-II. Row C3 (all components) performs best.

Interestingly, they also tested if the \(N\) clips need to be consecutive. Table 6 shows that using consecutive clips is vital for CIDEr (accuracy), though non-consecutive clips helped slightly with distinctiveness (Recall), likely because the visual differences were more obvious.

Table 6 showing the impact of sampling consecutive vs. non-consecutive clips.

Qualitative Analysis: The Eye Test

Numbers are great, but AD is about the user experience. Let’s look at how the descriptions actually read.

In Figure 5 (below), we see a comparison between Ground Truth (GT), AutoAD-Zero, and DistinctAD.

  • Scene: A dark room, a man on a bed.
  • GT: “The man on the bed sits in silhouette…”
  • AutoAD-Zero: Often misses the subtlety or hallucinates interactions.
  • DistinctAD: Captures specific actions like “Stephen removes his wallet,” whereas competitors might just say “Stephen stands.”

Qualitative comparison showing ground truth vs. DistinctAD. DistinctAD captures specific actions like ‘Stephen removes his wallet.’

Another example from the supplementary material highlights the difference in “character awareness” and specific action.

Comparison of qualitative results.

In the image above, notice how the Ground Truth emphasizes the lighting (“silhouette,” “lamp light”). DistinctAD’s architecture is designed to pick up on these unique environmental features rather than just detecting “man” and “bed.”

Conclusion

DistinctAD represents a significant step forward in making media accessible. By acknowledging that movies are not just a collection of random images, but a continuous, often repetitive flow of visual data, the researchers designed a system that mimics how humans perceive stories.

We filter out the background. We notice the twitch of a hand, the new object on the table, the change in lighting. By using Contextual EMA to mathematically model this filtering process, and CLIP-AD Adaptation to learn cinematic language, DistinctAD moves us away from the robotic “He looks” and toward narrative descriptions that actually tell a story.

While the CIDEr scores (around 27.3) show there is still a long way to go to match human proficiency, the high distinctiveness scores prove that AI can be taught to pay attention to the details that matter. For the visually impaired community, that detail makes all the difference.