Can AI Hallucinate Depth and Pose to Catch Crimes? Inside PI-VAD
Imagine you are watching a security camera feed in a busy store. You see a customer pick up an item, look at it, and put it in their bag. Is this a normal shopping event, or is it shoplifting?
To a human observer, the context matters. Did they look around nervously? Did they scan the item with a personal scanner? To a standard computer vision model relying solely on pixel data (RGB), the visual difference between “buying” and “stealing” is frustratingly subtle. Both involve reaching, grabbing, and bagging.
This is the core problem in Weakly-Supervised Video Anomaly Detection (WSVAD). Traditional models often struggle with complex, human-centric anomalies because they rely on a single modality: the visual RGB frames. They lack the nuanced “senses” required to distinguish a fight from a dance, or an accident from a traffic jam.
In a recent CVPR paper, researchers introduced a groundbreaking framework called PI-VAD (or \(\pi\)-VAD). Their approach is fascinating: they train a model to “hallucinate” five different sensory modalities—such as depth, pose, and optical flow—using only standard video frames. By inducing these poly-modal capabilities during training, the model achieves state-of-the-art anomaly detection without requiring expensive computations during deployment.
In this deep dive, we will explore how PI-VAD teaches a simple video model to see the world with the complexity of a multi-sensor array.
The Limitation of Single-Mode Vision
Before dissecting the solution, we must understand the bottleneck. Most existing video anomaly detection systems are uni-modal. They take a video stream (RGB frames) and try to classify segments as normal or abnormal.
This works reasonably well for obvious events, like a massive explosion or a car crashing at high speed. However, real-world surveillance is rarely that dramatic. It involves subtle human behaviors.

As illustrated in Figure 1 (a) above, different anomalies reveal themselves through different cues:
- Abuse and Arrests: These are motion-heavy. Optical flow (tracking movement patterns) shines here.
- Subtle Movements: Pose estimation (tracking body joints) and Depth (3D distance) can detect aggression or unusual body language that standard motion tracking misses.
- Context: Panoptic masks (segmenting objects) and Text (semantic descriptions) help the model understand the scene context—is this a store? A street?
The problem is that running five or six deep learning models simultaneously (one for pose, one for depth, one for segmentation, etc.) is computationally expensive. It destroys the real-time capability required for CCTV systems.
This brings us to the researchers’ central question: Can we train a model to benefit from all these modalities but run as if it only uses RGB?
Enter PI-VAD: The Poly-modal Induced Transformer
The researchers proposed PI-VAD, a framework that uses a “Teacher-Student” architecture. The core innovation is a component called the Poly-modal Inductor (PI).
Here is the high-level intuition:
- Training Phase: You have access to a rich dataset where you can pre-calculate everything—Pose, Depth, Masks, Optical Flow, and Text descriptions. You use this rich data to teach a “Student” network.
- Inference Phase (Real-world): The Student network, having learned from the rich data, now looks at a simple RGB video and “hallucinates” the missing modalities internally to make a prediction.
The Architecture Overview
Let’s look at how the pieces fit together.

As shown in Figure 2(a), the system splits into two paths:
- The Teacher (Fixed): A pre-trained standard VAD model. It provides a stable baseline of features (\(\mathcal{F}_{teach}\)).
- The Student (Learner): This is the model we are training. It takes RGB input (\(\mathcal{F}_{RGB}\)) and passes it through the Poly-modal Inductor.
The Student tries to detect anomalies, but it is constantly being corrected by two forces: the ground truth data and the knowledge distilled from the multi-modal integration.
The Heart of the Machine: The Poly-modal Inductor (PI)
The Poly-modal Inductor is where the magic happens. It is designed to take standard RGB features and inject them with multi-modal wisdom. It consists of two novel modules:
- Pseudo Modality Generation (PMG)
- Cross Modal Induction (CMI)
Let’s break these down step-by-step.
1. Pseudo Modality Generation (PMG)
Standard multi-modal approaches require you to run a Pose estimator (like YOLO-pose) or a Depth estimator (like DepthAnything) at runtime. PMG bypasses this.
The PMG module acts as a translator. It takes the Student’s RGB features and tries to reconstruct the embeddings of the other five modalities. It effectively asks: “Based on these pixels, what would the depth map look like? What would the pose skeleton look like?”
To train this, the researchers use “ground truth” embeddings extracted from off-the-shelf pre-trained models (like CLIP for text, RAFT for optical flow, and SAM for masks) only during the training phase.
The loss function for PMG is a mean squared error that forces the generated pseudo-embedding (\(\hat{e}\)) to match the real modality embedding (\(e\)):

Here, \(j\) represents the specific modality (Pose, Depth, Motion, Optical Flow, Text). By minimizing this loss, the network learns to compress multi-modal information directly into the RGB processing pipeline.
2. Cross Modal Induction (CMI)
Now that the model has generated these “pseudo” modalities, it needs to combine them intelligently. Simply averaging them won’t work because different modalities conflict; for example, a “static” scene might have high semantic relevance (Text) but zero Motion.
The CMI module aligns these disparate senses into a shared space using Contrastive Learning.
The goal is to ensure that the generated Pose embedding for Frame A is semantically close to the RGB features of Frame A, and far away from the features of Frame B. This is achieved using an InfoNCE loss function:

The total alignment loss sums this up across all five modalities:

Once aligned, the modalities are concatenated and passed through Transformer blocks. This allows the model to use an “Attention” mechanism to decide which sense is most important for the current frame. If the camera sees a fight, the mechanism might pay more attention to the Pose and Motion channels. If it sees an abandoned bag, it might focus on Panoptic Masks and Depth.
3. Distillation: Keeping the Teacher Happy
Finally, to ensure that this fancy multi-modal feature vector is actually useful for the specific task of Anomaly Detection, the student’s output is distilled. The model minimizes the difference between its enhanced features (\(\mathcal{F}^*_M\)) and the Teacher’s stable features (\(\mathcal{F}_{teach}\)):

This distillation process ensures that the “hallucinated” features don’t drift too far into abstraction and remain grounded in the task of video analysis.
Experimental Results: Does it Work?
The researchers tested PI-VAD on major benchmark datasets: UCF-Crime, XD-Violence, and the new MSAD dataset.
The results were impressive. PI-VAD outperformed not only single-mode RGB methods but also existing multi-modal methods that require heavy computation at inference time.

As seen in Table 1, PI-VAD achieves an AUC (Area Under the Curve) of 90.33% on UCF-Crime. This is a significant jump (+2.75%) over the best RGB-only model and even outperforms VadCLIP, a massive model that uses heavy vision-language integration.
Which Anomalies benefit the most?
The aggregate numbers are good, but the class-wise breakdown tells the real story.

Looking at Figure 3, we can see massive gains in specific categories:
- Explosion: The baseline (UR-DMU) scored a dismal 47%. PI-VAD jumped to 78%. This suggests that the multi-modal context (likely Depth and Audio/Motion) helped the model understand this chaotic event.
- Shoplifting: A notoriously hard category due to its subtlety. PI-VAD improved significantly (from 0.66 to 0.86).
- Fighting & Robbery: Consistent improvements were seen here, likely driven by the Pose and Motion integration.
Seeing is Believing: Qualitative Analysis
The researchers visualized the “Latent Activation” of the different modalities during anomalous events. This effectively shows us what the model’s “brain” is focusing on.

In Figure 4, look at the bottom row (Row-3). These colored lines represent how strongly each modality is activating:
- Burglary-024 (2nd column): Notice the spike in the Blue line (Pose) and Purple line (Motion) right when the burglary happens.
- RoadAccident-127 (3rd column): All modalities spike together. A car crash involves massive changes in depth, motion, object segmentation, and pose simultaneously.
- Shoplifting-016 (5th column): This is interesting. The activation is lower and messier, reflecting the subtlety of the crime, but the combination of Pose and Depth helps the model maintain a high anomaly score (the Pink shaded area in Row-2).
Which Sense is the Most Important?
Is one modality doing all the heavy lifting? The researchers ran an ablation study, turning on one modality at a time to see its impact.

Figure 5 (top graph) reveals that Depth (orange line) and Motion (brown line) are often the strongest individual contributors. Depth allows the model to understand the 3D geometry of the scene, distinguishing foreground action from background noise.
However, Figure 6 (bottom graph) shows that the best performance (the blue dashed line with triangles) comes from “All Modalities.” The paper notes that while Motion is great for “Fighting,” it might fail at “Vandalism” where semantic context (Text) or object segmentation (Masks) is more important. The power of PI-VAD lies in the synergy of all five.
Conclusion
The “Just Dance with \(\pi\)!” paper presents a compelling argument for the future of computer vision: Training is expensive, but inference must be cheap.
By shifting the burden of multi-modal analysis to the training phase, PI-VAD allows a lightweight model to deploy “heavyweight” intelligence. It essentially teaches a standard camera to imagine depth, skeletal pose, and semantic context, allowing it to detect complex crimes like shoplifting or abuse with unprecedented accuracy.
For students and researchers in VAD, the takeaways are clear:
- RGB is not enough for complex real-world behaviors.
- Pseudo-labeling and Distillation can transfer knowledge from massive foundation models (like SAM or CLIP) into smaller, task-specific networks.
- Cross-modal alignment is crucial. Having the data isn’t enough; you must force the model to understand how different data types relate to each other.
As surveillance environments become more complex, approaches like PI-VAD that balance high-level understanding with real-time efficiency will likely become the standard for intelligent video analytics.
](https://deep-paper.org/en/paper/file-2097/images/cover.png)