Introduction: The “Cheating” Student Problem in AI
Imagine a student taking a history test. The question asks, “Why did the Industrial Revolution begin in Britain?” The student doesn’t actually know the answer, but they notice a pattern in previous tests: whenever the words “Britain” and “Revolution” appear, the answer is usually “Option C.” They pick C and get it right.
Did the student learn history? No. They learned a statistical shortcut.
This is exactly what happens with modern Vision-Language Models (VLMs) in Video Question Answering (VideoQA). Models often rely on spurious correlations—statistical biases in the training data—rather than genuinely understanding the video content. For example, if a dataset frequently features videos of a woman holding a baby, the model might learn to associate the words “woman” and “baby” with the action “holding hands,” regardless of what is actually happening in the specific video clip provided.
This leads to a critical issue: Unfaithful Grounding. The model might guess the correct answer, but it is looking at the wrong part of the video (or not looking at the video at all).

As shown in Figure 1 above, a model might correctly answer that the woman “holds his hand” (Option E), but the green bar labeled “Unfaithful Grounding” shows the model is focusing on a completely irrelevant timeframe. The graph in (b) highlights the bias: the co-occurrence of “baby” and “woman” heavily skews towards specific answers, creating a shortcut.
In this post, we will dive deep into a paper titled “Cross-modal Causal Relation Alignment for Video Question Grounding.” We will explore how the authors propose a new framework, CRA, which uses Causal Inference to force the model to stop “cheating” and start looking for the true cause of an answer within the video timeline.
Background: Video Question Grounding (VideoQG)
Before dissecting the solution, we need to define the task. Video Question Grounding (VideoQG) is harder than simple VideoQA.
- VideoQA: The model receives a video and a question, and outputs an answer.
- VideoQG: The model must output the answer AND identify the specific time interval (start and end timestamps) in the video that contains the evidence for that answer.
The challenge is that valid training data for VideoQG is scarce because annotating start/end times for every question is expensive and time-consuming. Most models are trained in a “weakly supervised” manner—they only have the QA pairs and the full video, but not the specific timestamps.
The Problem of Spurious Correlations
In a standard correlational model, the probability of an answer (\(a\)) is calculated based on the Video (\(V\)) and Language (\(L\)).
\[P(a | V, L)\]However, hidden confounders (\(Z\))—like dataset biases—can influence both the inputs and the output, creating a back-door path that shortcuts the reasoning process. The authors of this paper use Structural Causal Models (SCM) to break these spurious links.
The Core Method: Cross-modal Causal Relation Alignment (CRA)
The authors propose the CRA framework. The goal is to align the causal relations between the video modality and the text modality.

As illustrated in Figure 2, the architecture is sophisticated. Let’s break down the mathematical objective. The model aims to find the optimal answer \(a^*\) and time interval \(t^*\):

Here, \(\Psi\) represents the VideoQA reasoning, and \(\Phi\) represents the grounding (finding the time interval). The variable \(w\) represents the temporal attention—essentially, which frames matter.
The CRA framework consists of three main engines working in unison:
- Gaussian Smoothing Grounding (GSG)
- Cross-Modal Alignment (CMA)
- Explicit Causal Intervention (ECI)
1. Gaussian Smoothing Grounding (GSG)
The first step is to figure out where in the video the answer lies. Standard cross-modal attention can be noisy, with attention weights spiking randomly across frames due to visual noise.
To fix this, the authors introduce a Gaussian Smoothing layer. Instead of taking raw attention scores, they apply an adaptive Gaussian filter. This forces the model to look for coherent, continuous segments of time rather than scattered frames.

As seen in Figure 3(a), the GSG module calculates the relevance between the global language feature (\(l_g\)) and the video features (\(v\)). The equation for the attention weights \(w\) is:

\(G(\cdot)\) is the adaptive Gaussian filter. This “smooths” the attention, making the predicted time interval (\(t\)) more reliable and resistant to noise.
2. Cross-Modal Alignment (CMA)
Since the model doesn’t have ground-truth timestamps during training (weak supervision), how does it learn to match the video segment to the text? The authors use Bidirectional Contrastive Learning.
The idea is simple: The representation of the correct video segment (\(v^+\)) should be very similar to the representation of the correct question (\(l^+\)), and very different from random segments or unrelated questions.
The loss function used is InfoNCE, a standard contrastive loss:

The total alignment loss combines both Video-to-Text and Text-to-Video alignment:

This encourages the model to pull relevant video and text features closer together in the embedding space.
3. Explicit Causal Intervention (ECI)
This is the most theoretically dense and innovative part of the paper. The authors argue that standard attention mechanisms are merely correlational (\(P(a|V,L)\)). To get true understanding, we need to perform Causal Intervention, denoted by the do-calculus operator: \(P(a | do(V), do(L))\).
This involves two types of “deconfounding”:
A. Linguistic Deconfounding (Back-door Adjustment)
Biases in language (like “baby” implying “holding”) are treated as a confounder \(Z_l\). The authors construct a Semantic Structure Graph (seen in Figure 3b) to identify entities (subject, verb, object) in the question.
By stratifying the data based on these semantic structures, they can apply back-door adjustment:

This equation essentially says: “Calculate the probability of the answer, but average it over all possible semantic contexts (\(Z_l\)) to remove the specific bias of any single context.”
B. Visual Deconfounding (Front-door Adjustment)
Visual bias is harder to define. A scene might have a dark background, or specific lighting that the model creates a shortcut for. Since we cannot easily describe all visual confounders, the authors use Front-door Adjustment.
They introduce a Mediator (\(M\)).
- \(V\): The full video.
- \(M\): The specific, grounded video segment (found by the GSG module).
- \(a\): The answer.
The causal path we want is \(V \to M \to a\). We want the answer to be derived from the specific segment (\(M\)), not just the general video vibe (\(V\)).
The front-door adjustment formula is:

To make this computable, the authors expand it using the chain rule and probability theory:

And finally, they approximate this complex sum using the Normalized Weighted Geometric Mean (NWGM):

This effectively forces the model to consider the “mediator” (the focused video segment) as the crucial bridge between the raw video and the answer, cutting off spurious shortcuts that skip the detailed evidence.
Experiments and Results
The researchers tested CRA on two challenging datasets: NExT-GQA and STAR.
- NExT-GQA: Focuses on causal (“why”, “how”) and temporal (“before”, “after”) questions.
- STAR: A dataset requiring logical reasoning and situated understanding.
Dataset Statistics
The datasets are robust, with thousands of videos and tens of thousands of questions.

Quantitative Performance
The results show that CRA outperforms existing baselines. In Table 3, we look at the NExT-GQA test set.
- Acc@GQA: This is the “Faithful Answer” metric. It means the model got the answer right AND looked at the right part of the video (IoU > certain threshold).
- Acc@QA: Just getting the answer right.

Notice that CRA (ours) consistently achieves the highest Acc@GQA (18.2% and 18.8% depending on the backbone). Even when comparing to large models like FrozenBiLM, adding the CRA framework improves the “faithfulness” of the answers.
Interestingly, larger models (like FrozenBiLM) have high QA accuracy but often lower grounding accuracy compared to their potential—they are “hallucinating” the right answer based on dataset priors. CRA fixes this.
Qualitative Analysis: Seeing the Improvement
Does the model actually look at the right things? Let’s look at the distribution of predicted video segments.

In Figure 4, the teal bars represent the Ground Truth. The orange bars are CRA, and the purple bars are a baseline (NG+).
- In (b), notice how CRA (orange) follows the Ground Truth (teal) distribution much better than the baseline. The baseline tends to predict extremely short clips (spikes on the left), whereas CRA captures a more realistic spread of segment durations.
Visualization of Attention
Finally, we can visualize the “attention weights”—what the model is watching.

In Figure 6, look at chart (a). The question asks about a specific reaction.
- Ground Truth: The teal bar at the bottom.
- Temp[CLIP] (w/o GS): The purple dashed line. It is chaotic and spikey.
- CRA (Temp[CLIP]): The solid orange line. Notice how it is smoother and creates a clear “hill” that aligns much better with the Ground Truth teal bar.
This visualizes exactly what the Gaussian Smoothing and Causal Intervention achieve: they clean up the noise and force the model to focus on the continuous event that matters.
Ablation Studies: Do we need all the parts?
The authors performed ablation studies to ensure every component was necessary.

- w/o GSG: Removing Gaussian Smoothing drops the IoU (Intersection over Union) significantly (10.6 \(\to\) 8.0). The model becomes bad at localizing the clip.
- w/o CMA: Removing Cross-Modal Alignment hurts both accuracy and grounding.
- w/o Causal: Removing the causal intervention (ECI/LCI) causes the biggest drop in faithful accuracy (Acc@GQA). This proves that the causal reasoning is essential for connecting the visual evidence to the answer.
Conclusion and Implications
The CRA (Cross-modal Causal Relation Alignment) framework represents a significant step forward in making AI systems more robust and interpretable. By moving away from simple correlation and embracing Causal Inference, the authors successfully created a model that doesn’t just guess the answer—it “watches” the video to find the proof.
Key Takeaways:
- Don’t Trust Shortcuts: Standard VideoQA models often cheat using dataset biases.
- Smooth the Noise: Gaussian Smoothing helps models identify coherent video events rather than noisy frames.
- Intervene Causally: Using Front-door and Back-door adjustments allows models to isolate the true visual evidence (the mediator) from confounding biases.
This approach is crucial for future applications where “faithfulness” is non-negotiable—such as in legal video analysis, medical imaging, or autonomous navigation, where getting the right answer for the wrong reason could be dangerous.
](https://deep-paper.org/en/paper/2503.07635/images/cover.png)