How to Train AI on Missing Data: A Deep Dive into DefectFill

Imagine you are running a high-tech manufacturing line for semiconductor chips or precision automotive parts. You want to automate quality control using AI. To train a model to spot defects (like a scratch on a lens or a crack in a hazelnut), you generally need thousands of examples of those defects.

But here is the catch: modern manufacturing is extremely good. Defects are rare—often occurring in fewer than 1 in 10,000 items. You simply cannot collect enough data to train a supervised deep learning model effectively.

This is the “Data Scarcity” problem in industrial anomaly detection.

In this post, we will explore DefectFill, a novel research paper that proposes a solution: if you can’t find the data, generate it. But unlike previous attempts that produced blurry or unconvincing fakes, DefectFill uses Inpainting Diffusion Models to generate defects so realistic they blend seamlessly with the object’s texture, lighting, and geometry.

We will break down how this method works, the mathematics behind its custom loss functions, and why “low-fidelity” selection might actually be a good thing.

The Core Problem: Why “Copy-Paste” Isn’t Enough

Before deep generative models, researchers used simple tricks to simulate defects. A common method, CutPaste, involved cutting a patch from one image and pasting it onto another. While this creates an “anomaly,” it looks artificial. The edges are sharp, the lighting doesn’t match, and the texture is discontinuous.

Later, Generative Adversarial Networks (GANs) were used, but they often struggle with stability and diversity. Recently, Diffusion Models have taken over the world of image generation. However, standard text-to-image diffusion models (like Stable Diffusion) struggle to place a defect exactly where you want it without altering the rest of the perfectly good object.

This is where DefectFill comes in. As shown in the figure below, the goal is to take a normal object and a mask (a shape), and “fill” that mask with a specific type of defect.

Figure 1. DefectFill allows users to take a few reference defects and a mask, then realistically paint those defects onto healthy objects.

The system needs to do two things simultaneously:

Capture the defect details: A crack looks different than a stain.
Respect the object context: A crack on a hazelnut looks different than a crack on a ceramic tile.

Background: Inpainting Diffusion Models

To understand DefectFill, we need a quick refresher on Latent Diffusion Models (LDMs). These models work by gradually adding noise to an image until it is pure static, and then learning to reverse that process to reconstruct the image.

The forward process (adding noise) is described by:

Equation for the forward diffusion process adding Gaussian noise.

Here, \(x_0\) is the original image, and \(\epsilon\) is the noise. The model is trained to predict the noise \(\epsilon\) given a noisy input \(x_t\) and a text prompt \(\mathcal{P}\). The standard loss function looks like this:

Standard diffusion loss function minimizing the difference between predicted and actual noise.

Inpainting modifies this process. The model isn’t just generating from scratch; it is filling in a hole. To do this, the input to the model isn’t just the noisy latent image (\(x_t\)). We concatenate it with the masked background (\(b\)) and the mask (\(M\)) itself.

Concatenation formula for inpainting inputs.

This tells the model: “Keep everything in the background (\(b\)) exactly as it is, and only invent new pixels inside the area defined by the mask (\(M\)).”

The DefectFill Method

The authors of DefectFill fine-tune a pre-trained Stable Diffusion inpainting model. However, simply telling the model “draw a crack” isn’t precise enough for industrial inspection. They need the model to learn a specific concept of a defect from a few reference images (e.g., 5-10 photos of broken hazelnuts).

To achieve this, they introduce a specialized training pipeline with three unique loss functions.

1. The Dual-Path Architecture

The training process is clever. It splits the workflow into two paths to ensure the model learns both the defect itself and how it sits on the object.

Take a look at the architecture overview below:

Figure 2. The DefectFill architecture showing the dual-path training pipeline. The upper path focuses on the defect, while the lower path focuses on the object context.

Upper Pipeline (Defect Focus): This path uses the specific defect mask \(M\) and a prompt like “A photo of [V]”*, where \([V^*]\) is a special token representing the defect. This teaches the model what the defect looks like.
Lower Pipeline (Object Focus): This path uses random masks (\(M_{rand}\)) and a prompt like “A hazelnut with [V]”*. This teaches the model the relationship between the object (hazelnut) and the defect.

2. The Three Loss Functions

The standard diffusion loss (Equation 2) isn’t specific enough. DefectFill replaces it with a weighted combination of three terms.

A. Defect Loss (\(\mathcal{L}_{def}\))

This is the most critical component. It forces the model to learn the intrinsic features of the defect (e.g., the jagged edges of a crack).

The input \(x_t^{def}\) combines the noisy image, the defect-free background, and the specific defect mask.

Concatenation input for the defect loss.

The loss is calculated only within the masked region.

Equation for Defect Loss.

Notice the \(M \odot\) term. This means we are masking the loss calculation. We don’t care if the model reconstructs the background perfectly in this step; we only care that it reconstructs the defect inside the mask accurately.

B. Object Loss (\(\mathcal{L}_{obj}\))

If we only used the defect loss, the model might generate a perfect crack, but it might look like a sticker floating on top of the image. The Object Loss ensures semantic blending.

It uses random boxes (\(M_{rand}\)) scattered over the image. The model must reconstruct the whole image, including the defect and the background.

Concatenation input for the object loss using random masks.

However, the authors use a trick here. They care more about the defect area than the random background. They create a weighted mask \(M'\):

Equation for Object Loss and the weighted mask M’.

Here, \(\alpha\) is a weight less than 1. This means errors in the defect area are penalized more heavily than errors in the background, but the background still matters.

C. Attention Loss (\(\mathcal{L}_{attn}\))

Finally, we need to ensure that when the model sees the token \([V^*]\), it actually looks at the pixels where the defect is. This uses the Cross-Attention maps from the UNet.

Equation for Attention Loss.

This loss forces the attention map for the token \([V^*]\) to align with the physical mask \(M\). If the model tries to put the defect features outside the mask, this loss penalizes it.

3. The Combined Objective

The final objective function is a weighted sum of these three components:

The total loss function combining defect, object, and attention losses.

The weights (\(\lambda\)) are determined experimentally, with the defect loss typically carrying the most weight.

4. Low-Fidelity Selection (LFS)

Once the model is trained, we can generate infinite defect images. But diffusion models are stochastic—sometimes they generate a messy blur, or worse, they just “heal” the image and generate a normal texture instead of a defect.

How do we automatically throw away the bad samples?

The authors propose Low-Fidelity Selection (LFS). This sounds counter-intuitive: usually, we want high fidelity. But in inpainting, “high fidelity” to the original background means the model didn’t change anything—it just reconstructed the healthy object!

We want the generated area to be different from the original healthy area.

Figure 3. Low-Fidelity Selection chooses the sample that differs most from the original background, ensuring a defect was actually generated.

As shown in Figure 3, the method generates several variations. It then calculates the LPIPS score (a perceptual similarity metric) between the result and the original. It picks the image with the highest LPIPS score (lowest fidelity), because that image has the strongest, most visible defect.

Experiments and Results

Does this complex architecture actually work? The authors tested DefectFill on the MVTec AD dataset, the standard benchmark for industrial anomaly detection.

Qualitative Results

Visually, the results are impressive. The model handles various materials—from organic textures like hazelnuts to manufactured grids and leathers.

Figure 4. Examples of generated defects. Note how the “Crack” in the hazelnut follows the curvature of the shell.

In the figure above, look at the hazelnut (second column). The generated crack respects the lighting and surface texture of the nut.

Comparing against competitors shows the difference even more clearly. In Figure 5 below, compare the bottom row (DefectFill) with the rows above it.

Figure 5. Comparison against DFMGAN and AnomalyDiffusion. DefectFill (bottom) creates much sharper and more context-aware defects.

Competitors like AnomalyDiffusion (AnoDiff) often struggle with color matching or produce blurry artifacts. DFMGAN often fails to blend the defect boundaries. DefectFill creates crisp, integrated defects.

Quantitative Analysis

To measure “realism,” researchers use KID (Kernel Inception Distance)—lower is better. To measure diversity, they use IC-LPIPS—higher is better.

Table 1. Quantitative comparison showing DefectFill achieving the best (lowest) KID scores across almost all categories.

DefectFill achieves significantly lower KID scores (e.g., 1.13 on Hazelnut vs. 21.16 for DFMGAN), proving the generated images are statistically much closer to real defects.

Impact on Downstream Tasks

The ultimate test isn’t just “does it look pretty?” but “does it help train a detector?”

The authors trained a ResNet-34 classifier using their synthetic data and tested it on real defects.

Table 3. Classification accuracy using synthetic data. DefectFill drastically outperforms baselines, achieving 100% on Hazelnut and Tile.

The classification accuracy (Table 3) is striking. For difficult categories like “Capsule,” DefectFill improves accuracy from ~45% (competitors) to 87.50%.

They also tested localization (finding exactly where the defect is) using a UNet.

Table 2. Localization performance (AUROC, AP) using UNet. DefectFill consistently scores higher.

Why Do We Need All Three Losses?

An ablation study (removing one piece at a time) reveals why the complex loss function is necessary.

Figure 7. Ablation study showing the impact of removing specific loss terms.

w/o \(\mathcal{L}_{def}\): The model often just reconstructs the background (fails to make a defect).
w/o \(\mathcal{L}_{obj}\): The defect looks pasted on; the zipper teeth might look fused or unnatural.
w/o \(\mathcal{L}_{attn}\): The defect spills out of the mask or doesn’t fill it completely.
\(\mathcal{L}_{ours}\) (All): Sharp, well-placed, and realistic.

Figure 6 further compares DefectFill against a standard Stable Diffusion model using a different loss (CLiC). Standard models tend to prioritize “healing” the image, whereas DefectFill prioritizes inserting the anomaly.

Figure 6. Comparison of DefectFill against standard Stable Diffusion with CLiC loss.

Limitations and Future Work

While DefectFill is a significant step forward, it isn’t magic. The authors candidly present failure cases, particularly regarding “structural” defects.

Figure S2. Failure cases involving structural defects like flipped metal nuts.

Because DefectFill relies on inpainting a masked area, it struggles with global anomalies. For example, if a metal nut is flipped over (a structural change), masking just a part of it doesn’t really work—the model learns the texture of the back of the nut but doesn’t understand the concept of “flipped.” Similarly, for misplaced transistors, the randomness of generation can sometimes result in a ghost-like transparency rather than a solid object.

Conclusion

DefectFill represents a paradigm shift in industrial AI. Instead of relying on manual data collection for rare events, we can now use Inpainting Diffusion Models to hallucinate realistic defects.

By carefully designing loss functions that balance defect detail, object context, and attention alignment, DefectFill turns a shortage of data into an abundance of high-quality training material.

For students and researchers in computer vision, this paper highlights an important lesson: using a powerful pre-trained model (like Stable Diffusion) is a great start, but fine-tuning it with domain-specific constraints (masks and custom losses) is what solves real-world problems.

Key Takeaways

Inpainting > Generation: For defects, preserving the original object’s structure is key.
Context Matters: A defect isn’t just a texture; it’s a disruption of an existing surface. The Object Loss captures this.
Low Fidelity is Useful: When trying to create anomalies, the best result is often the one that looks least like the original healthy image.
SOTA Performance: Synthetic data can now train classifiers that rival those trained on real data.

The Core Problem: Why “Copy-Paste” Isn’t Enough#

Background: Inpainting Diffusion Models#

The DefectFill Method#

1. The Dual-Path Architecture#

2. The Three Loss Functions#

A. Defect Loss (\(\mathcal{L}_{def}\))#

B. Object Loss (\(\mathcal{L}_{obj}\))#

C. Attention Loss (\(\mathcal{L}_{attn}\))#

3. The Combined Objective#

4. Low-Fidelity Selection (LFS)#

Experiments and Results#

Qualitative Results#

Quantitative Analysis#

Impact on Downstream Tasks#

Why Do We Need All Three Losses?#

Limitations and Future Work#

Conclusion#

Key Takeaways#