Imagine you are a quality control inspector on a factory line. Thousands of components pass by every hour. Your job isn’t just to spot a broken part; you have to explain why it’s broken. Is it a scratch? A dent? Is the soldering messy?
Now, imagine trying to teach an AI to do this. While modern Multimodal Large Language Models (MLLMs) like GPT-4o are incredible at describing a sunset or reading a menu, they struggle significantly when asked to find a microscopic crack in a screw or a slight discoloration on a medical scan. They lack the “specialist” eye required for Anomaly Detection (AD).
In this post, we are diving deep into a new research paper titled “Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models.” We will explore how researchers from Johns Hopkins University and Honda Research Institute developed Anomaly-OneVision (Anomaly-OV), a model designed to bridge the gap between general AI reasoning and the precision required for detecting anomalies.
The Problem: Generalists vs. Specialists
The field of Anomaly Detection (AD) has traditionally relied on “unsupervised” learning. You show a model thousands of “normal” images (e.g., perfect screws), and it learns to flag anything that looks different.
However, this approach has a major flaw: Data Scarcity. In the real world, you don’t always have thousands of normal samples, nor do you have examples of every possible defect. This has given rise to Zero-Shot Anomaly Detection (ZSAD)—the ability to detect defects in objects the model has never seen before, without any specific training on that object.
Recently, Multimodal Large Language Models (MLLMs) have promised a revolution in computer vision. Yet, when researchers tested models like GPT-4o on industrial defects, they found a disconnect.

As shown in Figure 2, generalist models often hallucinate. In the candle example above, GPT-4o correctly guesses there is an anomaly but invents a reason (“wick facing the opposite direction”), missing the actual defect (a small hole/crack near the wick). Anomaly-OV, the model proposed in this paper, correctly identifies the damage near the wick.
The core problem is that MLLMs are trained on general internet data. They lack the fine-grained visual attention required to spot tiny defects and the specific vocabulary to reason about them.
Part 1: Building the Foundation (The Dataset)
Before building a better model, the researchers faced a hurdle: there was no large-scale dataset designed to teach MLLMs how to reason about anomalies. Existing datasets provided images and binary labels (Normal/Abnormal) or segmentation masks, but no textual explanations.
To solve this, the authors introduced Anomaly-Instruct-125k.
Creating “In-the-Wild” Data
Industrial data is expensive and proprietary. To create a robust dataset, the researchers built an automated pipeline to collect “in-the-wild” anomaly data (WebAD).

As illustrated in Figure 8, the process is ingenious:
- Collection: They utilized GPT-4o to generate pairs of search terms (e.g., “A scratched car” vs. “A brand-new car”).
- Cleaning: They used CLIP features to remove duplicates and verify that the images actually matched the descriptions.
- Instruction Generation: Finally, they fed these cleaned images back into GPT-4o to generate detailed questions and answers about the anomalies.
The result is a massive dataset covering industrial, medical, and daily-life objects.

Figure 5 shows the diversity of this dataset. It includes conversational data where the model must answer questions like “What is the potential cause?” and “How can this be prevented?”, moving beyond simple detection into complex reasoning.
Part 2: The Method (Anomaly-OneVision)
Now, let’s look at the architecture. The goal of Anomaly-OneVision (Anomaly-OV) is to create a “specialist” visual assistant that guides the “generalist” LLM.
The researchers base their model on LLaVA-OneVision, but they introduce a crucial mechanism: the Anomaly Expert. This component identifies suspicious regions in an image and forces the LLM to pay attention to them.

Figure 3 provides the architectural blueprint. Let’s break down the flow:
- Image Input: The high-resolution image is split into patches (\(I_0\) to \(I_4\)).
- Visual Encoder: A standard vision transformer (ViT) extracts features.
- The Specialist (LTFM): This is the core innovation (explained below). It analyzes features to find anomalies before the LLM gets involved.
- VT Selector: Based on what the Specialist finds, the model selects and “zooms in” on the most suspicious visual tokens.
- LLM: The Large Language Model receives the standard visual features plus the emphasized suspicious tokens to generate the final text response.
The Core Innovation: Look-Twice Feature Matching (LTFM)
Humans inspect objects in two steps. First, we glance at the whole object to understand what it is. Then, we look closely at specific areas to check for defects. The researchers mimic this behavior with Look-Twice Feature Matching (LTFM).

Step 1: Generating the “Prototype”
Unlike previous methods that use fixed text prompts like “a photo of a damaged object,” Anomaly-OV learns to generate abnormality descriptions directly from the visual features.
It takes the global visual feature of the object (\(\mathbf{v}_0^o\)) and fuses it with learnable “Positive” (anomalous) and “Negative” (normal) embeddings (\(\mathbf{e}^+\) and \(\mathbf{e}^-\)).

This equation essentially says: The model creates a custom definition of “Normal” (\(\mathbf{d}_i^-\)) and “Abnormal” (\(\mathbf{d}_i^+\)) specific to the object currently being looked at.
Step 2: The Significance Map
Once the model knows what “normal” and “abnormal” look like for this specific object, it scans every patch of the image. It calculates a significance map (\(m_j\)) by comparing each local image patch against the abnormal prototype.

This formula uses cosine similarity and softmax to assign a score to every patch. If a patch looks more like the “abnormal” prototype than the “normal” one, it gets a high score.
Visualizing the Significance
Does this math actually work? Yes. The significance maps accurately highlight defects without any manual supervision (masks).

In Figure 7, you can see the red boxes (ground truth) and the “Significance Map” below them. The model successfully “heats up” the pixels where the anomalies are located (the scratch on the block, the spot on the nut, the lines on the tile).
Helping the LLM Focus
Current MLLMs process thousands of visual tokens. It’s finding a needle in a haystack. The Visual Token Selector (VT Selector) uses the significance map calculated above to filter the data.

By multiplying the visual features by the significance map (\(\mathbf{m}_j\)), the model suppresses the background noise and amplifies the signal from the defective areas. These emphasized tokens are then fed into the LLM with a special prompt <adv> (standing for adversarial/anomalous features), effectively saying to the LLM: “Hey, pay attention to this specific spot!”
Part 3: Experiments and Results
The researchers compared Anomaly-OV against state-of-the-art ZSAD methods (like WinCLIP and AnomalyCLIP) and generalist MLLMs (like GPT-4o).
Zero-Shot Detection Performance
For standard anomaly detection (classifying an image as good or bad), Anomaly-OV outperforms existing methods by a significant margin.

Figure 1 shows a radar chart of AUROC scores (a standard metric where higher is better). Anomaly-OV (the blue line) covers the largest area, surpassing methods like WinCLIP and AnomalyCLIP across diverse datasets including VisA and MPDD.
The quantitative data backs this up:

As seen in Table 2, Anomaly-OV achieves an average score of 88.6, beating the previous best (AdaCLIP) of 85.3.
Reasoning Capabilities
The true power of Anomaly-OV lies in its ability to converse. The researchers tested the model on their new benchmark, VisA-D&R (Detection & Reasoning).
Case Study: Industrial Inspection (PCB)
In this example involving a Printed Circuit Board (PCB), the user asks if there is an anomaly.

- GPT-4o: Claims the sensor appears intact.
- LLaVA-OV (Base model): Says there is no obvious anomaly.
- Anomaly-OV: Correctly identifies the bent LED and explains that it is not aligned properly.
Case Study: Fine-Grained Details (Macaroni)
Food inspection requires spotting subtle organic changes.

Here, Anomaly-OV identifies a “yellowish spot” on the macaroni, showcasing its ability to detect color shifts that generalist models often ignore as lighting variations.
Extension to Medical and 3D
The paper also demonstrates that this “specialist” approach works outside the factory.

In Table 6, we see Anomaly-OV diagnosing medical imagery (Pneumonia in a chest X-ray) and analyzing 3D renderings (a bulge in an object). This suggests the architecture is robust across different domains of computer vision.
Conclusion
Anomaly-OneVision represents a significant step forward in making AI useful for high-stakes visual inspection. By acknowledging that generalist LLMs are not enough, the researchers successfully designed a hybrid system:
- A Specialist Visual Encoder (LTFM) that mimics human inspection by “looking twice”—first globally, then locally.
- A Generalist LLM that uses those emphasized visual cues to reason and explain.
Combined with the release of the Anomaly-Instruct-125k dataset, this work opens the door for AI assistants that don’t just tell us that something is wrong, but explain what, why, and how to fix it. Whether it’s a scratched car bumper, a defective computer chip, or a medical anomaly, models like Anomaly-OV are bringing us closer to automated, intelligent visual reasoning.
](https://deep-paper.org/en/paper/2502.07601/images/cover.png)