Why Video AIs Hallucinate: Disentangling Action and Scene with MASH-VLM

Imagine showing an AI a video of a person boxing. The catch? They are doing it inside a library. A typical Video Large Language Model (Video-LLM) might look at the bookshelves and quiet atmosphere and completely ignore the boxing, describing the scene as “students reading.” Or, it might see the boxing motion and hallucinate a “boxing ring” in the background, ignoring the books entirely.

This phenomenon is known as Action-Scene Hallucination. It occurs when a model relies too heavily on the context of the scene to guess the action, or uses the action to incorrectly infer the scene.

Today, we are diving into a research paper titled “MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations.” This paper proposes a novel architecture that forces the AI to process where things are (spatial) and what is happening (temporal) separately, resulting in a much more accurate understanding of complex videos.

Figure 1 illustrating the problem of Action-Scene Hallucination.

As shown in Figure 1 above, this isn’t just a theoretical problem. When existing models see a snow-covered mountain with no one on it, they often hallucinate “skiing” or “snowboarding” just because the background looks like a ski resort. MASH-VLM aims to fix this.

The Problem: Why Do Video-LLMs Hallucinate?

To understand the solution, we first need to understand why current models fail. The researchers identify two primary culprits: Entangled Attention and Biased Positional Embeddings.

1. The Entanglement Trap

Most Video-LLMs treat video as a sequence of tokens (chunks of information). They use a mechanism called “attention” to let these tokens talk to each other. In standard approaches, spatial tokens (visual details of a single frame) and temporal tokens (changes over time) are mixed together.

Because the model is trained on vast amounts of data, it learns statistical correlations. It learns that “Snowy Mountain” usually correlates with “Skiing.” If the spatial and temporal features are intermingled, the model takes a shortcut: it sees the mountain and guesses the skiing, bypassing the actual verification of movement.

2. The Positional Bias (The RoPE Issue)

Large Language Models use Rotary Position Embeddings (RoPE) to understand the order of data. In a sentence, RoPE tells the model that word #1 comes before word #2.

However, when applied to video, standard RoPE creates a bias. Visual tokens are usually fed into the model in a sequence: Spatial tokens first, then Temporal tokens. Because the Text tokens (the question you ask the AI) come at the end, they are mathematically “closer” to the Spatial tokens than the Temporal ones.

Comparison of attention mechanisms and positional embeddings.

As illustrated in Figure 2(a), standard RoPE causes the text tokens to pay excessive attention to spatial tokens simply because they are closer in the sequence. This leads the model to over-prioritize the background scene while ignoring the temporal action dynamics.

Enter MASH-VLM: The Solution

The researchers introduce MASH-VLM, which stands for Mitigating Action-Scene Hallucination in Video-LMs. The core philosophy is disentanglement. If the model confuses space and time, the solution is to force it to process them separately before combining them.

The architecture introduces two key innovations: Harmonic-RoPE and DST-Attention.

Overview of the MASH-VLM architecture.

Innovation 1: Harmonic-RoPE

To fix the positional bias where the model ignores temporal tokens, the authors propose Harmonic-RoPE.

Standard RoPE assigns a unique, sequential ID to every token. This creates a large “distance” between the early temporal tokens and the final text tokens. Harmonic-RoPE changes the rules by expanding the dimensions of the position IDs.

It assigns Balanced Positional IDs to spatial and temporal tokens. This means that, mathematically, the model sees both the spatial features and the temporal features as being “equidistant” from the text. It levels the playing field, ensuring the model listens to the motion data just as intently as it listens to the static image data.

Diagram explaining Harmonic-RoPE.

As shown in Figure 4, standard RoPE (left) creates a disparity. Harmonic-RoPE (right) uses a “Balanced Rotation” (\(\theta_0\)) to align the spatial and temporal tokens, while keeping a “Distinctive Rotation” (\(\theta_1\)) to maintain the necessary order.

The mathematical formulation for this harmonic assignment ensures that for specific dimensions (even numbers), the position IDs are shared, while for others (odd numbers), they remain distinct:

Equation for Harmonic Position IDs.

Innovation 2: DST-Attention

The second innovation addresses the “shortcut” problem. DST-Attention (Disentangled Spatial-Temporal Attention) is a custom attention mechanism that restricts who can talk to whom within the neural network.

The researchers use Masked Attention to prevent direct interaction between spatial and temporal tokens.

Spatial Tokens are allowed to look at each other (Bi-directional attention) because understanding a scene requires looking at the whole image at once.
Temporal Tokens use Causal attention (looking only at previous tokens) to preserve the flow of time.
The Disentanglement: Crucially, the mask prevents spatial tokens from attending to temporal tokens and vice versa during the feature extraction phase.

By blocking these interactions, the model cannot lazily rely on the scene to guess the action. It is forced to learn a distinct representation for the scene and a distinct representation for the movement.

The UNSCENE Benchmark

How do you prove your model is better at catching hallucinations if existing benchmarks don’t test for it? You build a new one.

The authors introduced the UNSCENE benchmark (UNusual context & SCENE-only). This dataset is specifically curated to trick AI models.

The UNSCENE Benchmark Generation Pipeline.

The dataset creation involved a clever use of GPT-4:

Collection: They manually collected videos with unusual contexts (e.g., someone swimming in a river, but the river looks like a street flood) or scene-only videos (empty rooms).
Trap Generation: They asked GPT-4 to generate “hallucination labels”—plausible but wrong answers. For a snowy field, a hallucination label might be “snowboarding.”
Dual QA: They created binary (Yes/No) questions. To pass, the model must answer “No” to the hallucination and “Yes” to the ground truth.

Here is a glimpse of what these difficult “Unusual Context” videos look like:

Examples of unusual context videos from UNSCENE.

In Figure 9 (top left), we see someone putting (golf) in an office. A standard model might see the desks and say “working,” or see the club and hallucinate a “golf course.” MASH-VLM needs to identify both correctly.

Experimental Results

So, does disentangling space and time actually work? The results suggest a resounding yes.

Performance on UNSCENE

On the newly created UNSCENE benchmark, MASH-VLM achieved state-of-the-art performance.

Table showing results on UNSCENE benchmark.

Looking at Table 2, the improvement is drastic. In the “Unusual context” category, MASH-VLM scores 80.25% on scene recognition, while the next best model (VideoChat2) scores only 51.04%. This proves that when the action doesn’t match the scene, MASH-VLM is much less likely to get confused.

General Video Understanding

One might worry that restricting the model (by masking attention) would hurt its general performance. However, on standard benchmarks like MVBench, MASH-VLM also dominates.

Table comparing MASH-VLM on MVBench.

It achieves an average score of 57.6%, beating out heavyweights like GPT-4V and VideoChat2. This indicates that disentangled representations aren’t just good for preventing hallucinations; they create a more robust video understanding overall.

Qualitative Analysis: Seeing What the AI Sees

To truly understand the improvement, we can look at the “Attention Scores”—essentially a heat map of what the AI is focusing on when it generates an answer.

Comparison of attention scores between Baseline and MASH-VLM.

In Figure 8(c) above, look at the baseline model (top graphs). When asked about a person reading, it focuses heavily on the spatial tokens (brown lines) but fails to capture the temporal nuance, leading to a hallucination.

In contrast, MASH-VLM (bottom graphs) shows a much more balanced activation. It consults both spatial and temporal tokens when necessary. It doesn’t let the background dominate the decision-making process.

Conclusion

The MASH-VLM paper highlights a critical flaw in how we have been building multimodal AIs: by treating space and time as a single, messy stream of data, we encouraged models to take statistical shortcuts.

By introducing Harmonic-RoPE to balance the positional importance of tokens and DST-Attention to enforce a strict separation of duties, MASH-VLM forces the AI to be more honest. It must verify the action independently of the scene.

As we move toward AIs that act as reliable narrators for the blind, automated security monitors, or autonomous agents, mitigating hallucination is paramount. MASH-VLM offers a compelling blueprint for how structured architectural changes can lead to more trustworthy and accurate artificial intelligence.

The Problem: Why Do Video-LLMs Hallucinate?#

1. The Entanglement Trap#

2. The Positional Bias (The RoPE Issue)#

Enter MASH-VLM: The Solution#

Innovation 1: Harmonic-RoPE#

Innovation 2: DST-Attention#

The UNSCENE Benchmark#

Experimental Results#

Performance on UNSCENE#

General Video Understanding#

Qualitative Analysis: Seeing What the AI Sees#

Conclusion#