Introduction

Imagine you have a perfect photo of a pepperoni pizza, but you want to remove just one specific slice to show the wooden plate underneath. You fire up a state-of-the-art AI inpainting tool, mask out the slice, and hit “generate.”

Ideally, the AI should generate the texture of the wooden plate. But often, standard diffusion models will do something frustrating: they replace the pepperoni slice with… a cheese slice. Or perhaps a distorted “ghost” of the pepperoni remains.

Why does this happen? The answer lies in how these models are trained. Most Latent Diffusion Models (LDMs) are trained to reconstruct images from noise. When you ask them to fill in a masked area, their primary instinct is to find what should be there based on context. In a pizza photo, the context screams “pizza,” so the model generates pizza.

This is the core problem of Erase Inpainting: the conflict between Coherence (making the image look natural) and Elimination (actually removing the object).

In this post, we are diving deep into a paper titled “Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways”. The researchers propose a novel framework called EraDiff. Instead of relying on standard denoising paths, they fundamentally retrain the diffusion process to understand the concept of “fading away.”

Figure 1. Diverse erase inpainting results produced by our proposed EraDiff, where images before and after removal are presented in pairs, and the areas to be erased in the original images have been marked. The EraDiff can eliminate targets in various complex real-world scenes while ensuring visual coherence in the generated images.

As seen in Figure 1 above, EraDiff successfully removes complex objects—like a person from a café scene or specific ingredients from a plate—without leaving behind the artifacts common in other models.

Background: The Struggle with Standard Diffusion

To understand why EraDiff is necessary, we first need to look at why current methods fail at this specific task.

The Limits of GANs and Patching

Historically, object removal relied on “copy-paste” methods (like PatchMatch) that borrowed pixels from surrounding areas. While efficient, these methods fail when the background has complex structures or semantic meaning. Later, Generative Adversarial Networks (GANs) improved upon this but often suffered from “pattern repetition,” where they would just tile a texture repeatedly to fill a hole, lacking a global understanding of the scene.

The Diffusion Dilemma

Latent Diffusion Models (LDMs), such as Stable Diffusion, represented a massive leap forward. They generate high-quality, natural images by progressively removing noise from a latent representation.

However, the standard training paradigm for inpainting involves taking an image, adding noise, masking a part of it, and asking the model to recover the original image.

  • The Goal: \(Noise \rightarrow Original Image\)
  • The Reality: If the masked area contained an object (like a dog), the model learns to reconstruct the dog.

When you use this trained model to erase a dog, the model is fighting its own training. It sees a dog-shaped mask and noise, and its internal probability distribution suggests “there should be a dog here.” This leads to the generation of unexpected objects or artifacts, as the model tries to “save” the content rather than delete it.

Core Method: EraDiff

The authors of EraDiff argue that the “diffusion pathway”—the trajectory the model takes from noise to the final image—needs to be recalibrated specifically for erasure.

The EraDiff framework introduces two major innovations:

  1. Chain-Rectifying Optimization (CRO): A new training paradigm that simulates the gradual elimination of objects.
  2. Self-Rectifying Attention (SRA): A mechanism to stop the model from paying attention to the artifacts it’s trying to erase.

Figure 2. The overview of our proposed Erase Diffusion, termed EraDiff. Left: Dynamic image synthesis. Each image is initially transformed using techniques like matting, scaling, and copypasting. A mix-up strategy then synthesizes a series of dynamic images that simulate the gradual fading of the object. Top: Chain-Rectifying Optimization (CRO). The standard sampling pathway is prone to generating artifacts (black dashed lines). In contrast, we establish a new sampling path for erasing (red dashed lines) that better aligns the reverse sampling trajectory with a clear background. Bottom: Self-Rectifying Attention (SRA). The standard self-attention mechanism may inadvertently amplify artifacts, diverging from the expected diffusion pathway. By modifying the attention activation, we guide the model to bypass artifact regions, enhancing its focus on the background and ensuring a more accurate erase sampling path.

As illustrated in Figure 2, the architecture modifies both the optimization (top) and the attention mechanism within the network (bottom). Let’s break these down.

1. Chain-Rectifying Optimization (CRO)

The goal of CRO is to establish a diffusion path from Object to Background.

In standard diffusion, the path is defined by adding Gaussian noise. In CRO, the researchers create a specific path where the object “fades out” as time steps increase.

Data Synthesis: creating the “Fading” Effect

To train a model to erase, you need pairs of images: one with the object and one without (the clean background). Public datasets rarely provide this perfectly, so the authors devised a data synthesis strategy.

They take an original image (\(x_0^{ori}\)), use a matting model to cut out the main object, transform it (scale/rotate), and paste it onto a background to create a synthetic object image (\(x_0^{obj}\)).

Figure 11. Data synthesis process for model training in this study.

Figure 11 shows this process. By generating these pairs, the model has a definitive “Before” (with object) and “After” (background only).

Dynamic Latent States

This is the mathematical heart of the paper. Instead of just adding noise to the object image, the authors introduce Dynamic Images (\(\tilde{x}_t^{mix}\)).

The model is fed a mix of the background and the object, weighted by the time step \(t\).

\[ \tilde{\boldsymbol{x}}_{t}^{mix} = (1 - \lambda_t) \boldsymbol{x}_{0}^{ori} + \lambda_t \boldsymbol{x}_{0}^{obj} \]
  • When \(t=0\) (clean image), the image is mostly the object.
  • As \(t\) increases (more noise), the object contribution (\(\lambda_t\)) decreases, and the background contribution increases.

This simulates a smooth transition where the object gradually becomes transparent and vanishes into the background as the diffusion process moves toward pure noise. The model is then trained to traverse this specific path backward: starting from a state where the object is faint/gone, and moving toward a state where the background is reconstructed without bringing the object back.

The New Optimization Objective

Because the underlying “true” state changes at every time step (it’s a moving target between object and background), the standard diffusion loss function (which compares predicted noise against actual noise) isn’t sufficient.

The researchers propose a new loss function that minimizes the distance between the model’s predicted latent state and the “true” mixed state at the previous time step.

\[ \min_{\theta} \mathbb{E}_{\gamma \sim \mathrm{Uniform}(1,\gamma_m), t} \left\| \boldsymbol{x}_{t-\gamma}^{mix} - p_{\theta}\bigl(\hat{\boldsymbol{x}}_{t-\gamma}^{mix} \mid \boldsymbol{x}_{t}^{mix}\bigr) \right\|_{2}^{2}. \]

Put simply: The model is punished if it tries to reconstruct the object when it should be fading it out.

2. Self-Rectifying Attention (SRA)

Even with the new training paradigm, there is a risk. During the early stages of sampling (denoising), the shape of the mask might “leak” information. The model might see the outline of the masked object and think, “This looks like a foreground object; I should attend to it.”

Standard Self-Attention mechanisms calculate relationships between all pixels. If the model pays attention to the artifact inside the mask, it amplifies that artifact in subsequent steps.

The SRA Mechanism

The solution proposed is elegant: force the attention mechanism to ignore the masked region.

The authors modify the standard attention equation. They take the binary mask \(m\) (where 0 is the area to erase) and create an extended mask \(m'\). This extended mask sets the attention score to \(-\infty\) for any connection involving the erased area.

\[ \mathrm{SRA}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{Softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{\top}}{\sqrt{d}} \cdot m'\right)\mathbf{V}. \]

By injecting this mask into the Softmax function, the probability of attending to the erased region drops to zero. The model is forced to look at the background to fill in the hole, rather than looking at the noisy “ghost” inside the hole.

Figure 8. Visualization of heatmaps representing attention block outputs in the presence and absence of the SRA mechanism.

Figure 8 vividly demonstrates this.

  • Output w/o SRA: The heatmap shows high activation on the object itself (the llama or the donut). The model is “looking” at the thing it’s supposed to delete.
  • SRA-based Output: The activation on the object disappears. The model shifts its focus to the surrounding context, allowing for a clean erasure.

Experiments & Results

The researchers compared EraDiff against several baselines, including SD2-Inpaint (Standard Diffusion), LaMa (a strong GAN-based baseline), and text-guided editors like Inst-Inpaint.

Quantitative Analysis

They tested on the OpenImages V5 dataset using metrics like FID (Fréchet Inception Distance) and LPIPS (Learned Perceptual Image Patch Similarity).

  • FID: Measures how realistic the images look overall.
  • LPIPS: Measures how perceptually similar the output is to the ground truth.
  • Local FID: Specifically checks the realism of the erased area.

The results were impressive:

Table 1. Quantitative assessment of various erase inpainting models on the OpenImages V5 dataset. Optimal results are highlighted in bold, with runner-up performance underlined.

While standard SD2-Inpaint had a good global FID score (meaning the image looked “nice”), it failed in Local FID. EraDiff achieved the best Local FID (3.799) and LPIPS scores.

But numbers don’t tell the whole story. The most critical metric for this task is Elimination: Did the object actually disappear? To measure this, the authors conducted a user study and used GPT-4V to evaluate whether the object was gone.

Figure 4. Results from the user study. EraDiff demonstrates enhanced performance, as indicated by its higher mean scores in both elimination and coherence evaluations.

As Figure 4 shows, human evaluators rated EraDiff significantly higher (8.18) on Elimination compared to LaMa (5.77) and SD2-Inpaint (3.94). This confirms that while other models might make a pretty picture, EraDiff is the best at actually obeying the user’s intent to remove the object.

Qualitative Comparison

Let’s look at the visual evidence.

Figure 3. Qualitative results of OpenImages V5 dataset compared among SD2-Inpaint, SD2-Inpaint with prompt guidance, PowerPaint, Inst-Inpaint, LaMa, and our approach.

In Figure 3, look at the second row (the plate with the pattern).

  • SD2-Inpaint and PowerPaint struggle to continue the pattern correctly, often leaving a blur or a wrong texture.
  • Inst-Inpaint changes the global color tone.
  • EraDiff (far right) seamlessly continues the plate’s texture.

Similarly, in the snow scene (bottom row), EraDiff removes the person while perfectly reconstructing the snowy path, whereas other models leave ghostly artifacts.

Ablation Study: Do we need both CRO and SRA?

The authors performed an ablation study to see if both components were necessary.

Figure 6. Visual examples for the ablation study comparing baseline, baseline with CRO, baseline with SRA, and baseline with both CRO and SRA, displayed left to right.

  • Baseline: The object (orange boot) is partially reconstructed.
  • + CRO: The background is better, but some artifacts remain.
  • + SRA: Texture is improved, but structure might be off.
  • CRO & SRA: The boot is completely gone, replaced by a perfect wall and floor.

This proves that CRO provides the correct “trajectory” for erasure, while SRA ensures the model doesn’t get distracted by artifacts along the way.

Generalization

Finally, does this work outside of standard datasets? The authors tested EraDiff on cartoons, e-commerce products, and complex artistic styles.

Figure 13. Comparison of visualizations for baseline models and the proposed EraDiff in scenarios of marketable products.

Figure 13 shows e-commerce applications. Removing a perfume bottle (top row) or a hand holding a product (second row) is a common commercial use case. EraDiff handles the reflections and shadows significantly better than the baselines, which often leave “smudges.”

Conclusion and Implications

EraDiff represents a shift in how we think about generative AI for editing. Rather than treating object removal as just another “inpainting” task, the researchers recognized it as a distinct problem requiring a distinct diffusion pathway.

By mathematically forcing the model to learn the transition from Object \(\rightarrow\) Background (via Chain-Rectifying Optimization) and architecturally forcing it to ignore the object during generation (via Self-Rectifying Attention), EraDiff achieves state-of-the-art results.

For students and researchers, the key takeaway here is the importance of task-specific calibration. General-purpose models like Stable Diffusion are powerful, but adapting the underlying diffusion chain to the specific physics of the task—in this case, the “fading” of an object—can unlock superior performance.