Image inpainting—the art of filling in missing or damaged parts of an image—has undergone a revolution with the advent of generative AI. Models like Stable Diffusion and FLUX can miraculously reconstruct missing scenery or remove unwanted objects. However, if you have experimented with these tools, you have likely encountered two frustrating phenomena: the model inserting a random, bizarre object where there should be empty space, or the filled-in area having a slightly different color tone than the rest of the image, looking like a “smear.”
In this post, we dive into a recent paper titled “Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency.” The researchers introduce ASUKA, a framework designed to discipline generative models. By combining the stability of Masked Auto-Encoders (MAE) with a novel “harmonization” decoder, ASUKA fixes the hallucinations and color shifts that plague even the best state-of-the-art models.

As shown in Figure 1 above, while standard Stable Diffusion (SD) might leave artifacts or blurs (middle row) or insert ghost-like structures (top row), ASUKA (right column) manages to seamlessly blend the inpainted region with high fidelity.
The Problem with Current Generative Inpainting
To understand why ASUKA is necessary, we first need to look at how modern inpainting works and where it fails. Most modern high-resolution inpainting models rely on a Latent Diffusion or Rectified Flow architecture.
These models work in a compressed “latent space.” They take an image, compress it into a smaller representation (latent) using a Variational Auto-Encoder (VAE), perform the creative generation in that small space, and then decode it back to a full image. While efficient, this process introduces two major issues:
1. Unwanted Object Insertion (Hallucination)
Generative models are trained to be creative. If you mask out a person standing on a beach, the model often thinks, “I should put another person here,” or “Maybe a dog belongs here.” This is called object hallucination.
The root cause is the training strategy. These models are often trained on random masks. Sometimes a mask covers a real object, and the model is penalized if it doesn’t reconstruct that object. Consequently, the model learns a bias: “If there is a hole, fill it with an object.”
2. Color Inconsistency
This issue is subtler but equally damaging to realism. When the model decodes the latent representation back into pixels, there is often a mismatch in brightness, saturation, or hue between the generated area and the surrounding original pixels.

As Figure 4 illustrates, these color shifts happen across various scenarios—indoor, outdoor, and with different mask shapes. The result is a filled region that looks like a patch or a sticker rather than a seamless part of the photo. This occurs because of information loss in the VAE compression and the gap between “real” latents and “generated” latents.
The ASUKA Solution
The researchers propose ASUKA (Aligned Stable inpainting with UnKnown Areas prior). It is a post-processing approach that improves existing pre-trained models (like Stable Diffusion or FLUX) without needing to retrain the massive generative backbone from scratch.
The framework tackles the two problems using two distinct strategies:
- Context-Stable Alignment: Using a Masked Auto-Encoder (MAE) to stop hallucinations.
- Color-Consistent Alignment: Using a specialized VAE decoder to fix color shifts.

Part 1: Curing Hallucinations with MAE
To stop the model from inventing random objects, ASUKA looks for a “second opinion” from a different type of AI: a Masked Auto-Encoder (MAE).
Unlike diffusion models, which are generative and creative, MAEs are trained purely for reconstruction. If you show an MAE a masked image, it tries to predict exactly what was there based on the context. It doesn’t try to be fancy; it tries to be accurate.
Why not just use MAE? MAEs are stable, but they produce blurry, low-resolution results. They lack the texture and detail of diffusion models.

Figure 3 demonstrates this clearly. The MAE result (second panel) is brown and blurry. If we simply use the MAE output as a starting point for Stable Diffusion (third panel), the result is still messy.
The Alignment Module ASUKA’s innovation is an alignment module. It takes the stable, “boring” structural information from the MAE and feeds it into the generative model as a condition. This replaces the text prompt (which is often empty or generic in inpainting tasks).
The alignment module bridges the gap. It tells the generative model: “Use the structure suggested by the MAE, but apply your high-quality texture and details.” This effectively suppresses the urge to insert random objects because the MAE prior says, “There is no object here, just background.”
Part 2: Fixing Color with Local Harmonization
The second half of ASUKA addresses the “smear” effect. The researchers identified that the standard VAE decoder is not accurate enough for inpainting.

Figure 6 shows that even just encoding and decoding an image (without any inpainting) causes color shifts, particularly in low-frequency areas (the general wash of color).
The Inpainting-Specialized Decoder To fix this, ASUKA retrains the decoder portion of the model. They treat the decoding process as a Local Harmonization task.
In standard decoding, the model only sees the latent code. In ASUKA’s decoding, the model is also given the unmasked pixels of the original image. It is explicitly taught to match the colors of the generated area to the ground truth of the surrounding area.
Training with Augmentation To make this decoder robust, the researchers use a clever training strategy involving Color and Latent Augmentation.

- Color Augmentation: They mess with the colors of the input during training. This forces the decoder to rely on the unmasked regions to figure out the correct color balance.
- Latent Augmentation: They simulate the imperfections of generated latents using a one-step estimation (shown in the equation below). This prepares the decoder to handle the slightly “off” data coming from the generative model.

The result is a decoder that acts like a color-correction artist, ensuring the seam between the old and new image is invisible.

Figure 8 compares the vanilla decoder (b) with the ASUKA decoder (d). Notice how the ASUKA decoder restores the correct lighting and tone, removing the dark cast seen in the vanilla output.
Experiments and Results
The team validated ASUKA on the standard Places2 dataset and a new, diverse dataset they constructed called MISATO (comprising indoor, landscape, building, and background images).

Visual Quality
The visual comparisons are striking. When compared to GAN-based methods (Co-Mod, LaMa) and standard Stable Diffusion variants, ASUKA produces the cleanest results.

In Figure 9, look at the third row (the white freezer). Standard SD tries to put something there—a handle, a shadow, a new object. ASUKA simply completes the white surface perfectly.
Quantitative Metrics
The numbers back up the visuals. The researchers used standard metrics like FID (Fréchet Inception Distance) and two new metrics they designed specifically for this paper:
- CLIP@mask (C@m): Measures how much the content hallucinates (higher is better/more consistent).
- Gradient@edge (G@e): Measures color smoothness at the mask boundary (lower is better).

As Table 1 shows, ASUKA-SD achieves state-of-the-art results, scoring lowest on FID and highest on human-preference proxies (U-IDS).
Extending to FLUX
One of the great features of ASUKA is that it is model-agnostic. The researchers applied the same framework to the newer FLUX model.

Table 2 confirms that ASUKA improves FLUX just as it improved Stable Diffusion, proving that the principles of MAE-guidance and harmonized decoding are universally applicable.
Conclusion
ASUKA represents a smart shift in how we approach generative AI limitations. Instead of blindly scaling up models or retraining massive networks, the authors identified specific structural weaknesses—hallucination and latent color shift—and designed targeted modules to fix them.
By utilizing the “boring but stable” Masked Auto-Encoder as a guide and retraining the decoder to respect local color context, ASUKA allows us to use powerful generative models for what they are best at: creating high-fidelity textures, without the unwanted surprises.
For students and researchers in computer vision, this paper highlights the importance of priors. Generative models are powerful, but they need guidance. Sometimes, the best way to move forward is to look back at simpler, reconstruction-based architectures like MAE and use them to anchor our modern creative engines.
](https://deep-paper.org/en/paper/2312.04831/images/cover.png)