Imagine you are running a high-tech manufacturing line for semiconductor chips or precision automotive parts. You want to automate quality control using AI. To train a model to spot defects (like a scratch on a lens or a crack in a hazelnut), you generally need thousands of examples of those defects.
But here is the catch: modern manufacturing is extremely good. Defects are rare—often occurring in fewer than 1 in 10,000 items. You simply cannot collect enough data to train a supervised deep learning model effectively.
This is the “Data Scarcity” problem in industrial anomaly detection.
In this post, we will explore DefectFill, a novel research paper that proposes a solution: if you can’t find the data, generate it. But unlike previous attempts that produced blurry or unconvincing fakes, DefectFill uses Inpainting Diffusion Models to generate defects so realistic they blend seamlessly with the object’s texture, lighting, and geometry.
We will break down how this method works, the mathematics behind its custom loss functions, and why “low-fidelity” selection might actually be a good thing.
The Core Problem: Why “Copy-Paste” Isn’t Enough
Before deep generative models, researchers used simple tricks to simulate defects. A common method, CutPaste, involved cutting a patch from one image and pasting it onto another. While this creates an “anomaly,” it looks artificial. The edges are sharp, the lighting doesn’t match, and the texture is discontinuous.
Later, Generative Adversarial Networks (GANs) were used, but they often struggle with stability and diversity. Recently, Diffusion Models have taken over the world of image generation. However, standard text-to-image diffusion models (like Stable Diffusion) struggle to place a defect exactly where you want it without altering the rest of the perfectly good object.
This is where DefectFill comes in. As shown in the figure below, the goal is to take a normal object and a mask (a shape), and “fill” that mask with a specific type of defect.

The system needs to do two things simultaneously:
- Capture the defect details: A crack looks different than a stain.
- Respect the object context: A crack on a hazelnut looks different than a crack on a ceramic tile.
Background: Inpainting Diffusion Models
To understand DefectFill, we need a quick refresher on Latent Diffusion Models (LDMs). These models work by gradually adding noise to an image until it is pure static, and then learning to reverse that process to reconstruct the image.
The forward process (adding noise) is described by:

Here, \(x_0\) is the original image, and \(\epsilon\) is the noise. The model is trained to predict the noise \(\epsilon\) given a noisy input \(x_t\) and a text prompt \(\mathcal{P}\). The standard loss function looks like this:

Inpainting modifies this process. The model isn’t just generating from scratch; it is filling in a hole. To do this, the input to the model isn’t just the noisy latent image (\(x_t\)). We concatenate it with the masked background (\(b\)) and the mask (\(M\)) itself.

This tells the model: “Keep everything in the background (\(b\)) exactly as it is, and only invent new pixels inside the area defined by the mask (\(M\)).”
The DefectFill Method
The authors of DefectFill fine-tune a pre-trained Stable Diffusion inpainting model. However, simply telling the model “draw a crack” isn’t precise enough for industrial inspection. They need the model to learn a specific concept of a defect from a few reference images (e.g., 5-10 photos of broken hazelnuts).
To achieve this, they introduce a specialized training pipeline with three unique loss functions.
1. The Dual-Path Architecture
The training process is clever. It splits the workflow into two paths to ensure the model learns both the defect itself and how it sits on the object.
Take a look at the architecture overview below:

- Upper Pipeline (Defect Focus): This path uses the specific defect mask \(M\) and a prompt like “A photo of [V]”*, where \([V^*]\) is a special token representing the defect. This teaches the model what the defect looks like.
- Lower Pipeline (Object Focus): This path uses random masks (\(M_{rand}\)) and a prompt like “A hazelnut with [V]”*. This teaches the model the relationship between the object (hazelnut) and the defect.
2. The Three Loss Functions
The standard diffusion loss (Equation 2) isn’t specific enough. DefectFill replaces it with a weighted combination of three terms.
A. Defect Loss (\(\mathcal{L}_{def}\))
This is the most critical component. It forces the model to learn the intrinsic features of the defect (e.g., the jagged edges of a crack).
The input \(x_t^{def}\) combines the noisy image, the defect-free background, and the specific defect mask.

The loss is calculated only within the masked region.

Notice the \(M \odot\) term. This means we are masking the loss calculation. We don’t care if the model reconstructs the background perfectly in this step; we only care that it reconstructs the defect inside the mask accurately.
B. Object Loss (\(\mathcal{L}_{obj}\))
If we only used the defect loss, the model might generate a perfect crack, but it might look like a sticker floating on top of the image. The Object Loss ensures semantic blending.
It uses random boxes (\(M_{rand}\)) scattered over the image. The model must reconstruct the whole image, including the defect and the background.

However, the authors use a trick here. They care more about the defect area than the random background. They create a weighted mask \(M'\):

Here, \(\alpha\) is a weight less than 1. This means errors in the defect area are penalized more heavily than errors in the background, but the background still matters.
C. Attention Loss (\(\mathcal{L}_{attn}\))
Finally, we need to ensure that when the model sees the token \([V^*]\), it actually looks at the pixels where the defect is. This uses the Cross-Attention maps from the UNet.

This loss forces the attention map for the token \([V^*]\) to align with the physical mask \(M\). If the model tries to put the defect features outside the mask, this loss penalizes it.
3. The Combined Objective
The final objective function is a weighted sum of these three components:

The weights (\(\lambda\)) are determined experimentally, with the defect loss typically carrying the most weight.
4. Low-Fidelity Selection (LFS)
Once the model is trained, we can generate infinite defect images. But diffusion models are stochastic—sometimes they generate a messy blur, or worse, they just “heal” the image and generate a normal texture instead of a defect.
How do we automatically throw away the bad samples?
The authors propose Low-Fidelity Selection (LFS). This sounds counter-intuitive: usually, we want high fidelity. But in inpainting, “high fidelity” to the original background means the model didn’t change anything—it just reconstructed the healthy object!
We want the generated area to be different from the original healthy area.

As shown in Figure 3, the method generates several variations. It then calculates the LPIPS score (a perceptual similarity metric) between the result and the original. It picks the image with the highest LPIPS score (lowest fidelity), because that image has the strongest, most visible defect.
Experiments and Results
Does this complex architecture actually work? The authors tested DefectFill on the MVTec AD dataset, the standard benchmark for industrial anomaly detection.
Qualitative Results
Visually, the results are impressive. The model handles various materials—from organic textures like hazelnuts to manufactured grids and leathers.

In the figure above, look at the hazelnut (second column). The generated crack respects the lighting and surface texture of the nut.
Comparing against competitors shows the difference even more clearly. In Figure 5 below, compare the bottom row (DefectFill) with the rows above it.

Competitors like AnomalyDiffusion (AnoDiff) often struggle with color matching or produce blurry artifacts. DFMGAN often fails to blend the defect boundaries. DefectFill creates crisp, integrated defects.
Quantitative Analysis
To measure “realism,” researchers use KID (Kernel Inception Distance)—lower is better. To measure diversity, they use IC-LPIPS—higher is better.

DefectFill achieves significantly lower KID scores (e.g., 1.13 on Hazelnut vs. 21.16 for DFMGAN), proving the generated images are statistically much closer to real defects.
Impact on Downstream Tasks
The ultimate test isn’t just “does it look pretty?” but “does it help train a detector?”
The authors trained a ResNet-34 classifier using their synthetic data and tested it on real defects.

The classification accuracy (Table 3) is striking. For difficult categories like “Capsule,” DefectFill improves accuracy from ~45% (competitors) to 87.50%.
They also tested localization (finding exactly where the defect is) using a UNet.

Why Do We Need All Three Losses?
An ablation study (removing one piece at a time) reveals why the complex loss function is necessary.

- w/o \(\mathcal{L}_{def}\): The model often just reconstructs the background (fails to make a defect).
- w/o \(\mathcal{L}_{obj}\): The defect looks pasted on; the zipper teeth might look fused or unnatural.
- w/o \(\mathcal{L}_{attn}\): The defect spills out of the mask or doesn’t fill it completely.
- \(\mathcal{L}_{ours}\) (All): Sharp, well-placed, and realistic.
Figure 6 further compares DefectFill against a standard Stable Diffusion model using a different loss (CLiC). Standard models tend to prioritize “healing” the image, whereas DefectFill prioritizes inserting the anomaly.

Limitations and Future Work
While DefectFill is a significant step forward, it isn’t magic. The authors candidly present failure cases, particularly regarding “structural” defects.

Because DefectFill relies on inpainting a masked area, it struggles with global anomalies. For example, if a metal nut is flipped over (a structural change), masking just a part of it doesn’t really work—the model learns the texture of the back of the nut but doesn’t understand the concept of “flipped.” Similarly, for misplaced transistors, the randomness of generation can sometimes result in a ghost-like transparency rather than a solid object.
Conclusion
DefectFill represents a paradigm shift in industrial AI. Instead of relying on manual data collection for rare events, we can now use Inpainting Diffusion Models to hallucinate realistic defects.
By carefully designing loss functions that balance defect detail, object context, and attention alignment, DefectFill turns a shortage of data into an abundance of high-quality training material.
For students and researchers in computer vision, this paper highlights an important lesson: using a powerful pre-trained model (like Stable Diffusion) is a great start, but fine-tuning it with domain-specific constraints (masks and custom losses) is what solves real-world problems.
Key Takeaways
- Inpainting > Generation: For defects, preserving the original object’s structure is key.
- Context Matters: A defect isn’t just a texture; it’s a disruption of an existing surface. The Object Loss captures this.
- Low Fidelity is Useful: When trying to create anomalies, the best result is often the one that looks least like the original healthy image.
- SOTA Performance: Synthetic data can now train classifiers that rival those trained on real data.
](https://deep-paper.org/en/paper/2503.13985/images/cover.png)