Introduction

Imagine you are a detective reviewing CCTV footage of a busy city street. Hours of mundane traffic pass by: cars stopping at red lights, pedestrians crossing, rain falling. Suddenly, for three seconds, a car swerves erratically and clips a bus before speeding off.

If you were a traditional computer vision model, you might flag a “spike” in an anomaly score at that timestamp. But you wouldn’t necessarily know why. Was it a fight? An explosion? A traffic accident? Furthermore, to understand that this was a “hit-and-run,” you need to watch the moments leading up to the swerve and the aftermath. You need context.

This is the challenge of Video Anomaly Understanding (VAU). While detection tells us when something went wrong, understanding tells us what, why, and how.

Current approaches struggle with two main issues. First, they lack hierarchical understanding; they either look at a single frame or the whole video, missing the nuance between a short action (a punch) and a long event (a riot). Second, processing long videos is computationally expensive. Models often use “uniform sampling” (taking a frame every second), which means they might blink and miss the critical split-second where the anomaly actually starts.

Enter Holmes-VAU, a new research paper that draws inspiration from Sherlock Holmes. Just as the famous detective ignores irrelevant details to zero in on critical clues, Holmes-VAU uses an Anomaly-focused Temporal Sampler (ATS) to focus its computational power exactly where it matters. Combined with a massive new benchmark dataset called HIVAU-70k, this method pushes the boundaries of how machines comprehend complex, long-term anomalies.

Figure 1: Motivation. Left: The need for hierarchical data. Right: The Holmes-VAU concept of focusing on anomaly-rich segments.

Background: From Detection to Understanding

To appreciate Holmes-VAU, we must first look at the landscape of video anomaly detection.

The Limits of Traditional Detection

Historically, Video Anomaly Detection (VAD) has been treated as a binary classification problem or a scoring task. Models are trained to output a score (0 to 1) for every frame. If the score is high, it’s an anomaly.

  • The Problem: A high score doesn’t tell you if the anomaly is a person falling down or a bank robbery.
  • The “Black Box”: These methods lack explainability. In high-stakes environments like surveillance or autonomous driving, knowing why an alarm triggered is as important as the alarm itself.

The Rise of Multimodal VAU

Recent advancements have moved toward Multimodal Video Anomaly Understanding. By combining visual data with Large Language Models (LLMs), researchers aim to generate text descriptions of anomalies. However, existing multimodal benchmarks suffer from a “granularity gap.” They typically provide annotations only at the video level (summarizing the whole clip) or the clip level (describing a few seconds).

Real-world anomalies are hierarchical. A “riot” (Video Level) consists of specific “clashes” (Event Level), which are made up of individual “punches” or “throws” (Clip Level). Without training data that captures this hierarchy, models struggle to reason about long-term context.

The Foundation: The HIVAU-70k Benchmark

Before building a better model, the researchers needed better data. They introduced HIVAU-70k, a large-scale benchmark designed to teach models to “think” hierarchically.

The Data Engine

Creating a dataset with 70,000+ annotations manually would be prohibitively expensive. The authors developed a semi-automated “Data Engine” that combines human expertise with the generative power of LLMs.

Figure 2: The Data Engine workflow. It moves from clip-level captions to event and video summaries using LLMs.

The process works in three stages:

  1. Hierarchical Video Decoupling: Annotators took long videos (from existing datasets like UCF-Crime and XD-Violence) and manually sliced them. They identified the precise start and end times of anomaly events, breaking them down into shorter “clips.”

  2. Hierarchical Free-text Annotation: This is where the hierarchy is built.

  • Clip Level: A visual perception model (or human) captions the short clips (e.g., “A man runs holding a bag”).
  • Event Level: An LLM aggregates these clip captions to summarize the event (e.g., “A suspect flees the scene of a robbery”).
  • Video Level: The LLM aggregates event summaries to describe the entire video context.
  1. Instruction Construction: The text is converted into Question-Answer pairs (Instructions) to train the model. These range from simple perception (“What is in the video?”) to complex reasoning (“Why is this considered an anomaly?”).

Dataset Statistics

The result is a rich dataset that forces models to learn connections between short-term visual cues and long-term semantic meaning.

Figure 3: HIVAU-70k Statistics. The histograms show the distribution of duration and word counts across clips, events, and videos.

As shown in Figure 3, the dataset covers a wide distribution of durations. Clips are short (focused on immediate action), while videos are long (requiring context). The annotations cover Captioning, Judgment, Description, and Analysis, providing a comprehensive training ground for the AI.

The Core Method: Holmes-VAU

Now, let’s look at the detective itself. Holmes-VAU is a multimodal system designed to efficiently process long videos and provide detailed explanations of anomalies.

The architecture addresses a specific bottleneck: Efficiency vs. Accuracy.

  • If you feed every frame of a 5-minute video into an LLM, you will run out of memory and compute immediately.
  • If you sample randomly or uniformly (e.g., every 50th frame), you might miss the anomaly entirely.

Holmes-VAU solves this with the Anomaly-focused Temporal Sampler (ATS).

Figure 4: The Holmes-VAU Framework. Note the Anomaly-focused Temporal Sampler (ATS) in the center, guiding the LLM.

1. Visual and Text Encoding

The system starts with standard feature extraction. It uses a pre-trained encoder (InternVL2) to convert video frames into visual tokens (\(V_i\)) and user questions into text tokens (\(X_q\)).

Equation 1

Here, \(\phi_v\) represents the visual encoder processing the video frames.

2. The Detective’s Instinct: Anomaly-focused Temporal Sampler (ATS)

This is the innovative core of the paper. The ATS acts as a filter, deciding which frames are worthy of the LLM’s attention.

Step A: The Anomaly Scorer First, a lightweight, efficient anomaly detection network scans the video. It assigns an anomaly score (\(s_i\)) to every frame. This network is fast and cheap to run. It produces a temporal curve where peaks indicate likely abnormal behavior.

Step B: Density-Aware Sampling Instead of picking frames at fixed intervals, the ATS treats the anomaly scores as a probability distribution. The intuition is simple: Sample more frames where the anomaly score is high.

To do this, the model calculates the cumulative sum (\(S_{cumsum}\)) of the anomaly scores:

Equation 2

In this equation, \(\tau\) is a small parameter that ensures even normal regions get some attention (so the model doesn’t ignore context entirely), but the focus remains heavily on the high-scoring regions. By sampling uniformly along the y-axis of this cumulative curve, the corresponding x-axis (time) points naturally cluster around the steep parts of the curve—the anomalies.

The visualization below demonstrates this perfectly. Look at how the red vertical lines (sampled frames) cluster densely around the purple shaded areas (ground truth anomalies).

Figure F: Visualization of ATS. The red lines indicate sampled frames, clustering around the actual anomalies (purple areas).

3. LLM Integration and Reasoning

Once the “clue” frames are selected by the ATS, they are projected into the language model’s feature space.

Equation 3

The Large Language Model (LLM) then takes these visual tokens (\(V_i\)) and the user’s text query (\(X_q\)) to generate a response. Because the input frames were intelligently selected, the LLM has high-resolution information about the anomaly without being overwhelmed by thousands of irrelevant background frames.

Equation 4

4. Training Strategy

The training is a two-step process:

  1. Train the Scorer: The lightweight anomaly scorer is trained using frame-level labels from the HIVAU-70k dataset.
  2. Instruction Tuning: The LLM is fine-tuned (using LoRA, a parameter-efficient technique) on the instruction-response pairs.

Equation 5

This loss function (\(\mathcal{L}_{AS}\)) ensures the scorer becomes accurate at distinguishing normal frames from abnormal ones, which is critical for the sampling step to work correctly.

Experiments and Results

The researchers tested Holmes-VAU on standard datasets (UCF-Crime and XD-Violence) to see how it stacked up against the competition.

Anomaly Detection Performance

First, can it simply find the anomaly? The authors compared Holmes-VAU against state-of-the-art (SOTA) methods, including unsupervised and weakly-supervised approaches.

Table 1: Detection Performance. Holmes-VAU outperforms both non-explainable and explainable methods.

As Table 1 shows, Holmes-VAU achieves an AP (Average Precision) of 87.68% on XD-Violence and an AUC of 88.96% on UCF-Crime. It significantly outperforms previous explainable methods like LAVAD. This proves that “smart sampling” doesn’t just save time; it actually improves accuracy by reducing noise.

Reasoning Performance

Can it explain why? The team evaluated the quality of the generated text using metrics like BLEU, CIDEr, and METEOR. They compared Holmes-VAU against general-purpose video LLMs (like Video-ChatGPT and Video-LLaVA).

Table 2: Reasoning Performance. Holmes-VAU dominates across clip, event, and video levels.

The results in Table 2 are striking. Holmes-VAU scores significantly higher on all metrics. For example, on the Event-level CIDEr metric (which measures how well the description matches human consensus), Holmes-VAU scores 1.519, while the closest competitor manages only 0.022. This massive gap highlights that general video models simply haven’t learned the “language” of anomalies—they miss the subtle cues that Holmes-VAU captures.

Why Hierarchy Matters

Is the hierarchical data actually necessary? The authors performed an ablation study, training the model with different combinations of Clip (C), Event (E), and Video (V) data.

Table 3: Ablation of Hierarchical Data. Using all three levels (C+E+V) yields the best results.

Table 3 confirms that the full combination yields the best performance. Training only on clips improves short-term perception but fails at long-term reasoning. Training only on videos misses the fine details. The hierarchy acts as a scaffold for learning.

The Power of ATS

Finally, does the Sherlock-inspired sampler beat standard sampling?

Table 4: Sampler Ablation. ATS outperforms Uniform and Top-K sampling, even with fewer frames.

Table 4 compares ATS against Uniform sampling and Top-K (just taking the highest scoring frames). ATS wins in every category. Crucially, Top-K performs worse than Uniform in some cases because it focuses too much on the peak of the anomaly and misses the context leading up to it. ATS balances focus with context.

Qualitative Analysis

Numbers are great, but seeing is believing. Let’s look at the model’s output compared to a standard model (InternVL2).

Figure 5: Qualitative Comparison. Holmes-VAU (right column) correctly identifies anomalies that baseline models (center columns) miss or hallucinate.

In the second row (the street protest scene), the baseline model hallucinates details or gives a vague description. Holmes-VAU correctly identifies “violent behavior” and “property damage,” capturing the essence of the anomaly. In the last row (normal street), Holmes-VAU correctly states “There is no anomaly,” while the other models hallucinate a car crash that never happened.

Conclusion

Holmes-VAU represents a significant step forward in how AI understands video. By moving beyond simple frame-level scores and embracing a hierarchical view of time, it bridges the gap between seeing a pixel change and understanding a “crime.”

The two key takeaways are:

  1. Data Hierarchy is Key: The HIVAU-70k benchmark shows that to teach a model about complex events, training data must describe the world at multiple granularities—from the micro-actions to the macro-events.
  2. Focus is Efficiency: The Anomaly-focused Temporal Sampler proves that we don’t need to process every frame to understand a video. We just need to find the right frames.

Just as Sherlock Holmes solves cases by filtering out the noise to find the signal, Holmes-VAU demonstrates that intelligent, adaptive sampling is the future of long-term video understanding. This opens exciting doors for real-world applications in safety, surveillance, and automated video analysis, where understanding the “why” is just as critical as detecting the “what.”