Beyond the Blur: How ASUKA Fixes Hallucinations and Color Shifts in Generative Inpainting

Image inpainting—the art of filling in missing or damaged parts of an image—has undergone a revolution with the advent of generative AI. Models like Stable Diffusion and FLUX can miraculously reconstruct missing scenery or remove unwanted objects. However, if you have experimented with these tools, you have likely encountered two frustrating phenomena: the model inserting a random, bizarre object where there should be empty space, or the filled-in area having a slightly different color tone than the rest of the image, looking like a “smear.”

In this post, we dive into a recent paper titled “Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency.” The researchers introduce ASUKA, a framework designed to discipline generative models. By combining the stability of Masked Auto-Encoders (MAE) with a novel “harmonization” decoder, ASUKA fixes the hallucinations and color shifts that plague even the best state-of-the-art models.

Comparison of ASUKA against standard models showing improved object removal and color consistency.

As shown in Figure 1 above, while standard Stable Diffusion (SD) might leave artifacts or blurs (middle row) or insert ghost-like structures (top row), ASUKA (right column) manages to seamlessly blend the inpainted region with high fidelity.

The Problem with Current Generative Inpainting

To understand why ASUKA is necessary, we first need to look at how modern inpainting works and where it fails. Most modern high-resolution inpainting models rely on a Latent Diffusion or Rectified Flow architecture.

These models work in a compressed “latent space.” They take an image, compress it into a smaller representation (latent) using a Variational Auto-Encoder (VAE), perform the creative generation in that small space, and then decode it back to a full image. While efficient, this process introduces two major issues:

1. Unwanted Object Insertion (Hallucination)

Generative models are trained to be creative. If you mask out a person standing on a beach, the model often thinks, “I should put another person here,” or “Maybe a dog belongs here.” This is called object hallucination.

The root cause is the training strategy. These models are often trained on random masks. Sometimes a mask covers a real object, and the model is penalized if it doesn’t reconstruct that object. Consequently, the model learns a bias: “If there is a hole, fill it with an object.”

2. Color Inconsistency

This issue is subtler but equally damaging to realism. When the model decodes the latent representation back into pixels, there is often a mismatch in brightness, saturation, or hue between the generated area and the surrounding original pixels.

Examples of color shifts and artifacts in standard inpainting tasks.

As Figure 4 illustrates, these color shifts happen across various scenarios—indoor, outdoor, and with different mask shapes. The result is a filled region that looks like a patch or a sticker rather than a seamless part of the photo. This occurs because of information loss in the VAE compression and the gap between “real” latents and “generated” latents.

The ASUKA Solution

The researchers propose ASUKA (Aligned Stable inpainting with UnKnown Areas prior). It is a post-processing approach that improves existing pre-trained models (like Stable Diffusion or FLUX) without needing to retrain the massive generative backbone from scratch.

The framework tackles the two problems using two distinct strategies:

Context-Stable Alignment: Using a Masked Auto-Encoder (MAE) to stop hallucinations.
Color-Consistent Alignment: Using a specialized VAE decoder to fix color shifts.

The architecture of the ASUKA framework showing the parallel MAE path and the specialized decoder.

Part 1: Curing Hallucinations with MAE

To stop the model from inventing random objects, ASUKA looks for a “second opinion” from a different type of AI: a Masked Auto-Encoder (MAE).

Unlike diffusion models, which are generative and creative, MAEs are trained purely for reconstruction. If you show an MAE a masked image, it tries to predict exactly what was there based on the context. It doesn’t try to be fancy; it tries to be accurate.

Why not just use MAE? MAEs are stable, but they produce blurry, low-resolution results. They lack the texture and detail of diffusion models.

Comparison showing why MAE alone or simple initialization is insufficient.

Figure 3 demonstrates this clearly. The MAE result (second panel) is brown and blurry. If we simply use the MAE output as a starting point for Stable Diffusion (third panel), the result is still messy.

The Alignment Module ASUKA’s innovation is an alignment module. It takes the stable, “boring” structural information from the MAE and feeds it into the generative model as a condition. This replaces the text prompt (which is often empty or generic in inpainting tasks).

The alignment module bridges the gap. It tells the generative model: “Use the structure suggested by the MAE, but apply your high-quality texture and details.” This effectively suppresses the urge to insert random objects because the MAE prior says, “There is no object here, just background.”

Part 2: Fixing Color with Local Harmonization

The second half of ASUKA addresses the “smear” effect. The researchers identified that the standard VAE decoder is not accurate enough for inpainting.

Analysis of VAE reconstruction errors showing color shifts in low-frequency fields.

Figure 6 shows that even just encoding and decoding an image (without any inpainting) causes color shifts, particularly in low-frequency areas (the general wash of color).

The Inpainting-Specialized Decoder To fix this, ASUKA retrains the decoder portion of the model. They treat the decoding process as a Local Harmonization task.

In standard decoding, the model only sees the latent code. In ASUKA’s decoding, the model is also given the unmasked pixels of the original image. It is explicitly taught to match the colors of the generated area to the ground truth of the surrounding area.

Training with Augmentation To make this decoder robust, the researchers use a clever training strategy involving Color and Latent Augmentation.

The decoder training process using color and latent augmentation.

Color Augmentation: They mess with the colors of the input during training. This forces the decoder to rely on the unmasked regions to figure out the correct color balance.
Latent Augmentation: They simulate the imperfections of generated latents using a one-step estimation (shown in the equation below). This prepares the decoder to handle the slightly “off” data coming from the generative model.

Equation for latent augmentation estimation.

The result is a decoder that acts like a color-correction artist, ensuring the seam between the old and new image is invisible.

Comparison of decoders showing ASUKA’s superior color matching.

Figure 8 compares the vanilla decoder (b) with the ASUKA decoder (d). Notice how the ASUKA decoder restores the correct lighting and tone, removing the dark cast seen in the vanilla output.

Experiments and Results

The team validated ASUKA on the standard Places2 dataset and a new, diverse dataset they constructed called MISATO (comprising indoor, landscape, building, and background images).

Visual breakdown of the domains included in the MISATO dataset.

Visual Quality

The visual comparisons are striking. When compared to GAN-based methods (Co-Mod, LaMa) and standard Stable Diffusion variants, ASUKA produces the cleanest results.

Qualitative comparison grid showing ASUKA outperforms GANs and standard SD.

In Figure 9, look at the third row (the white freezer). Standard SD tries to put something there—a handle, a shadow, a new object. ASUKA simply completes the white surface perfectly.

Quantitative Metrics

The numbers back up the visuals. The researchers used standard metrics like FID (Fréchet Inception Distance) and two new metrics they designed specifically for this paper:

CLIP@mask (C@m): Measures how much the content hallucinates (higher is better/more consistent).
Gradient@edge (G@e): Measures color smoothness at the mask boundary (lower is better).

Quantitative comparison table on MISATO and Places2 datasets.

As Table 1 shows, ASUKA-SD achieves state-of-the-art results, scoring lowest on FID and highest on human-preference proxies (U-IDS).

Extending to FLUX

One of the great features of ASUKA is that it is model-agnostic. The researchers applied the same framework to the newer FLUX model.

Table showing ASUKA’s performance improvements when applied to the FLUX model.

Table 2 confirms that ASUKA improves FLUX just as it improved Stable Diffusion, proving that the principles of MAE-guidance and harmonized decoding are universally applicable.

Conclusion

ASUKA represents a smart shift in how we approach generative AI limitations. Instead of blindly scaling up models or retraining massive networks, the authors identified specific structural weaknesses—hallucination and latent color shift—and designed targeted modules to fix them.

By utilizing the “boring but stable” Masked Auto-Encoder as a guide and retraining the decoder to respect local color context, ASUKA allows us to use powerful generative models for what they are best at: creating high-fidelity textures, without the unwanted surprises.

For students and researchers in computer vision, this paper highlights the importance of priors. Generative models are powerful, but they need guidance. Sometimes, the best way to move forward is to look back at simpler, reconstruction-based architectures like MAE and use them to anchor our modern creative engines.

The Problem with Current Generative Inpainting#

1. Unwanted Object Insertion (Hallucination)#

2. Color Inconsistency#

The ASUKA Solution#

Part 1: Curing Hallucinations with MAE#

Part 2: Fixing Color with Local Harmonization#

Experiments and Results#

Visual Quality#

Quantitative Metrics#

Extending to FLUX#

Conclusion#