Can Your AI Trust Its Own Eyes? Solving the Consistency Problem in Multimodal OOD Detection

Imagine an autonomous vehicle navigating a busy city street. It has been trained on thousands of hours of driving footage—cars, pedestrians, cyclists, and traffic lights. Suddenly, a person wearing a giant, inflatable dinosaur costume runs across the crosswalk.

The car’s camera sees a “pedestrian,” but the shape is wrong. The LiDAR sensor detects an obstacle, but the dimensions don’t match a human. The system is confused. This is the Out-of-Distribution (OOD) problem: the challenge of handling data that deviates significantly from what the model saw during training.

In safety-critical applications, identifying these anomalies is just as important as recognizing the normal data. Modern systems attempt to solve this by using multimodal data—combining video, audio, and optical flow. The logic is simple: if the video sensor is confused, maybe the audio sensor knows better.

However, a new research paper titled “DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection” identifies a critical flaw in how we currently handle these multimodal systems. By assuming that all “normal” training data is perfectly consistent, existing methods inadvertently sabotage their own accuracy.

In this deep dive, we will explore the proposed solution: Dynamic Prototype Updating (DPU). We will look at how it helps models distinguish between a “weird” version of a known object and a truly unknown threat.

The Problem: When “Normal” Isn’t Uniform

To understand the contribution of this paper, we first need to understand the current state of Multimodal OOD detection.

The Intuition of Multimodal Discrepancy

Recent advancements have leveraged a phenomenon known as prediction discrepancy. When a multimodal model sees a familiar object (In-Distribution, or ID), all its senses usually agree. If it sees a dog barking, the video says “dog,” the audio says “dog,” and the motion (optical flow) says “dog.”

However, when the model encounters something unknown (OOD), the modalities often disagree. The video might be unsure, while the audio is confident about something else. Previous state-of-the-art methods, such as the “Agree-to-Disagree” framework, exploit this. They intentionally amplify the disagreement during training to make OOD samples stand out more.

The Flaw: Intra-Class Variation

The problem is that existing methods treat every sample of a specific class as if it were the “perfect” example of that class. They assume high cohesion.

But in the real world, intra-class variation is massive. Consider the class “Swimming.”

Sample A: A professional swimmer in a clear Olympic pool.
Sample B: A child splashing in a dark, muddy lake.

Both are “Swimming.” However, Sample B is much “noisier” and further from the mathematical average of the class. If we blindly force the model to amplify discrepancies on Sample B—which is already hard to recognize—we might push the model to classify this valid swimming video as an anomaly.

The ID accuracy declines after using uniform discrepancy intensification in the SOTA framework [11] (denoted as ‘AN’; the middle bars), and the accuracy improves using our proposed DPU (the right bars). This figure presents the results of MSP and ReAct in Far-OOD detection using HMDB51 as the ID dataset.

As shown in Figure A above, when researchers applied uniform discrepancy intensification (the “AN” bars), the accuracy on known data (ID Accuracy) actually dropped compared to the original model. By treating difficult ID samples like outliers, the model got confused.

The Solution: Dynamic Prototype Updating (DPU)

The researchers propose a framework that adapts to the data. Instead of applying a “one-size-fits-all” rule to every training sample, DPU treats samples differently based on how close they are to the “ideal” version of their class.

The framework is built on three pillars:

Cohesive-Separate Contrastive Training (CSCT): organizing the feature space.
Dynamic Prototype Approximation: finding the true “center” of a class.
Pro-ratio Discrepancy Intensification: applying pressure intelligently.

Let’s look at the high-level architecture before breaking down the math:

Step 1: Cohesive-Separate Contrastive Training (CSCT)

Before we can calculate distances, we need a stable map. The goal of Step 1 is to organize the model’s understanding of the world so that:

All “swimming” clips are clumped together (Intra-class cohesion).
“Swimming” clips are far away from “Running” clips (Inter-class separation).

The researchers use Contrastive Learning. They take a batch of videos and look at their feature embeddings. They want to pull samples of the same class closer and push samples of different classes apart.

They use a modified loss function called Robust Marginal Contrastive Learning (RMCL).

Equation 3

Here, \(f_{pos}\) represents the similarity scores for positive matches (same class), and \(f_{neg}\) represents negative matches (different classes). To make the model more robust, they add an “angular margin” \(m\) to the positive pairs. This forces the model to not just barely learn the difference, but to learn it with a safety buffer.

Equation 4

However, contrastive learning alone isn’t enough. It can sometimes be unstable if a batch of data has high variance. To fix this, DPU incorporates Invariant Risk Minimization (IRM). They measure the variance of the loss within the batch and try to minimize it.

Equation 7

By minimizing the variance (\(\text{Var}(\mathcal{L}^j)\)), the model ensures that it learns a representation that is consistent, regardless of which specific samples end up in a training batch. The final loss for this step combines the contrastive loss and the variance penalty:

Equation 8

Step 2: Dynamic Prototype Approximation

Now that we have a well-organized feature space, we need to find the “Prototype” for each class. A prototype is essentially the mathematical center of gravity for a specific category (e.g., the “perfect” representation of a dog).

In standard approaches, the prototype might just be a moving average of all samples. But remember the “muddy lake” swimmer? If we let that outlier sample influence our prototype too much, the center of the class will shift toward the outlier, making the prototype less representative of the norm.

DPU solves this by updating the prototype dynamically based on variance.

Equation 10

This equation is a moving average update. \(P^y\) is the prototype for class \(y\). \(H^y_{av}\) is the average of the current batch.

The magic lies in the Update Rate term:

\[ \frac{1}{\gamma + \operatorname{Var}(\mathcal{L}^j) N^y} \]

Notice the denominator. It includes \(\operatorname{Var}(\mathcal{L}^j)\), the variance of the current batch.

Low Variance: The batch is clean and consistent. The denominator is small. The Update Rate is high. The prototype updates significantly because we trust this data.
High Variance: The batch is messy or noisy. The denominator is large. The Update Rate is low. The prototype barely moves, protecting it from being corrupted by outliers.

Step 3: Pro-ratio Discrepancy Intensification (PDI)

This is the core contribution that directly improves OOD detection.

We want to train the model to be sensitive to discrepancies between modalities (e.g., video vs. audio). But as established, we shouldn’t punish hard-to-classify ID samples.

DPU calculates a dynamic intensification rate (\(\mu\)) for every single sample. It looks at the distance between the sample (\(F_i^v\)) and its class prototype (\(P^y\)).

Sample is near the prototype: This is a “textbook” example. We want the model to be very confident here. We keep the discrepancy intensification low so the model focuses on learning the features, not fighting noise.
Sample is far from the prototype: This is an edge case. We increase the discrepancy intensification. This teaches the model that as data gets further from the center, the modalities might start to disagree, and that disagreement is a signal of being an outlier.

Equation 11

Here, the Sigmoid function scales the similarity. If the similarity is high (close to 1), the term \((1 - \text{Sigmoid})\) becomes small, reducing the penalty. If the sample is dissimilar (far away), the term grows, intensifying the discrepancy loss Discr.

Bonus: Adaptive Outlier Synthesis (AOS)

To further help the model recognize “unknowns,” DPU generates fake OOD data during training. It takes two different class prototypes (e.g., “Boxing” and “Clapping”) and fuses them together to create a “Frankenstein” prototype.

Equation 12

The model is then trained to maximize the uncertainty (entropy) on these synthetic outliers, effectively teaching it: “If you see something that looks like a weird mix of boxing and clapping, don’t be confident—flag it.”

Experimental Results

The theory sounds solid, but does it work? The researchers tested DPU across five diverse datasets, including HMDB51 (movies/YouTube), UCF101 (action recognition), and Kinetics-600 (large-scale actions).

They evaluated the model on two types of tasks:

Near-OOD: The unknown samples are somewhat similar to the known ones (e.g., different sports).
Far-OOD: The unknown samples are completely different (e.g., distinguishing cartoons from real-life videos).

Quantitative Performance

The results show a stark improvement over existing methods. In OOD detection, we look for:

FPR95 (False Positive Rate at 95% Recall): Lower is better. This measures how often the model mistakenly calls an OOD sample “normal.”
AUROC: Higher is better. A general measure of detection accuracy.

Near-OOD Results (Table 1):

![Table 1. Multimodal Near-OOD Detection results using video and optical flow (↑ the higher the better; ↓ the lower the better).‘A2D’ refers tothe Agree-to-isageebaseline1,hile‘A’combiesAgre-to-DisagreewithNP-Mix.DUcosistentlyenanesalletris across various datasetsandOODdetectionmodels,showcasingits efectivenessandadaptability inmultimodalOODdetection.

In Table 1, look at the rows for Energy or VIM (popular base OOD methods).

On the HMDB51 dataset, the base Energy method has an FPR95 of 43.36.
Adding the previous SOTA method (+AN) drops it to 36.38.
Adding +Ours (DPU) drops it further to 35.07.

The trend is consistent across almost every method and dataset. DPU consistently lowers the False Positive Rate while maintaining or improving ID Accuracy.

Far-OOD Results (Table 2):

Table 2.Multimodal Far-OOD Detection results using video and optical flow, with HMDB51 as the ID dataset( ↑ the higher the better; the lower the better).‘AN’ represents the combined use of Agree-to-Disagree and NP-Mix algorithms.

The Far-OOD results are even more dramatic. Look at the VIM method in the bottom row.

FPR95 drops to essentially 0.01.
AUROC hits 99.99.
This indicates that DPU is exceptionally good at separating completely foreign data from known data.

Figure 1．Performance of our DPU applied to four base OOD methods in the Multimodal Far-OOD Detection task (84.2), using HMDB51 as the ID dataset and Kinetics600 as the OOD dataset. Red symbols denote the OOD methods enhanced by DPU, demonstrating that DPU significantly improves their performances.

Figure 1 visually summarizes this gain. The red squares (DPU Enhanced) are consistently located in the top-left corner (High AUROC, Low False Positive Rate) compared to the blue circles (Without DPU).

Why Does It Work? (Ablation Studies)

The researchers didn’t just trust the final numbers; they broke down why it works.

Is the “Dynamic” part necessary? They tested the model using fixed intensification rates (e.g., always intensifying by 0.3 or 0.7) rather than the adaptive, prototype-based method.

Table 3.Ablation study on the discrepancy intensification rate for multimodal Near-OOD using the Mahalanobis with UCF101 and Kinetics-6OO datasets.It highlights that DPU achieves the best performance across all metrics, showing the advantages of DPU’s adaptive intensification strategy over fixed intensification ratios.

Table 3 shows that no single fixed rate works for all datasets. A low rate (0.1) works okay for UCF101 but fails miserably on Kinetics-600. DPU (bottom row) adapts automatically, achieving the best score on both.

Visualizing the Feature Space Finally, we can see the effect of DPU by visualizing the feature embeddings using t-SNE.

Figure 4. Visualization of the learned embeddings on ID and OOD data using t-SNE on the UCF101 50/51 dataset before and after training with DPU.We observe better separation after using DPU.

In Figure 4(a) (left), without DPU, the In-Distribution samples (orange) are messy, and the OOD samples (blue) are mixed in or dangerously close. In Figure 4(b) (right), with DPU, the orange cluster is tighter and more cohesive, and the blue OOD samples are pushed clearly to the periphery. This clear separation is exactly what allows the model to say, “I don’t know what this is,” with confidence.

Conclusion

The DPU framework represents a maturity in how we approach Multimodal OOD detection. It moves beyond the naive assumption that all training data is perfect. By acknowledging intra-class variation—the fact that a “dog” can look like a thousand different things—and adjusting its learning strategy dynamically, DPU achieves state-of-the-art results.

For students and researchers, the key takeaway is the power of adaptive training.

CSCT ensures the playground is organized.
Dynamic Prototypes ensure we know where the “safe zones” are, without being swayed by outliers.
Pro-ratio Intensification ensures we only punish the model for discrepancies when it truly matters.

As AI systems are deployed in increasingly complex, open-world environments (like autonomous driving or medical diagnostics), methods like DPU will be essential for ensuring these systems know not just what they are seeing, but when they are confused.

The Problem: When “Normal” Isn’t Uniform#

The Intuition of Multimodal Discrepancy#

The Flaw: Intra-Class Variation#

The Solution: Dynamic Prototype Updating (DPU)#

Step 1: Cohesive-Separate Contrastive Training (CSCT)#

Step 2: Dynamic Prototype Approximation#

Step 3: Pro-ratio Discrepancy Intensification (PDI)#

Bonus: Adaptive Outlier Synthesis (AOS)#

Experimental Results#

Quantitative Performance#

Why Does It Work? (Ablation Studies)#

Conclusion#