Aligning Diffusion Models Faster and Better: A Deep Dive into InPO

If you have ever played with Text-to-Image (T2I) models like Stable Diffusion, you know the struggle: you type a prompt, get a weird result, tweak the prompt, get a slightly less weird result, and repeat. While these models are powerful, they aren’t naturally aligned with human aesthetic preferences or detailed instruction following.

In the world of Large Language Models (LLMs), we solved this using Reinforcement Learning from Human Feedback (RLHF) and, more recently, Direct Preference Optimization (DPO). These methods take “winning” and “losing” outputs and teach the model to prefer the winner.

But applying this to image diffusion models is notoriously difficult. Why? Because diffusion is a multi-step process (a “Markov chain”). If an image turns out ugly after 50 steps of denoising, which specific step was responsible? This is the classic credit assignment problem.

In this post, we are diving into a new paper, “InPO: Inversion Preference Optimization with Reparametrized DDIM,” which proposes a clever mathematical trick to solve this. It treats the complex, multi-step diffusion process as a single-step generation problem, allowing for highly efficient alignment.

Figure 1: InPO achieves state-of-the-art alignment results with just 400 fine-tuning steps.

The Problem: Sparse Rewards in a Long Chain

To understand why InPO is necessary, we first need to look at why standard alignment is hard for diffusion.

In a standard DPO setup for diffusion (like Diffusion-DPO), the model generates an entire trajectory of noise (from \(t=T\) down to \(t=0\)). We only get feedback (the “reward”) at the very end, when the final image is revealed.

Existing methods try to spread this reward across all timesteps. However, this creates sparse rewards. The signal is weak and noisy because the model struggles to connect a specific action at timestep \(t=500\) with the final visual quality at \(t=0\). This leads to:

Inefficiency: It takes a massive amount of training steps to converge.
Subpar Quality: The model often fails to learn fine-grained details.

The authors of InPO realized that if they could pinpoint exactly which latent variables (noise) were responsible for a specific image, they could optimize the model much more directly.

The Core Method: InPO

The InPO framework rests on three main pillars: Reparameterization, Inversion, and the Objective Function. Let’s break them down.

1. The Insight: Single-Step View via Reparameterization

Standard diffusion models predict the noise \(\epsilon\) at step \(t\). However, we can also look at what the model thinks the final image \(x_0\) looks like at that specific step. This is denoted as \(x_0(t)\).

By reparameterizing the Denoising Diffusion Implicit Model (DDIM) equation, the authors propose viewing the diffusion model not as a chain, but as a timestep-aware single-step generative model.

Mathematically, they define the predicted clean image \(x_0(t)\) based on the current noisy state \(\bar{x}_t\) and the predicted noise \(\epsilon_\theta\):

Equation: Definition of predicted x0 at timestep t.

This simple shift allows the researchers to define a reward \(r(x_0, c)\) (where \(c\) is the text prompt) that can be distributed to the latent variables \(x_t\) at any timestep.

Equation: Relationship between initial reward and joint reward.

This equation essentially says: “The reward for the final image is the expectation of the rewards at the intermediate steps.” This bridge allows us to optimize specific timesteps directly.

2. Finding the “Guilty” Latents via Inversion

Here is the crux of the method. We have a dataset of human preferences containing pairs of images: a Winner (\(x_0^w\)) and a Loser (\(x_0^l\)).

To fine-tune the model, we need to know: What specific noise \(x_t\) would generate this winner or loser image?

If we can find the specific \(x_t\) for the winner, we can tell the model: “When you see this noise, make it more likely to produce this image.” Conversely, we tell it to avoid the loser path.

To do this, the authors use DDIM Inversion. Just as diffusion goes from Image \(\rightarrow\) Noise, Inversion goes from Image \(\rightarrow\) Noise.

Figure 2: Illustration of DDIM Inversion. The process works backward from the clean image x0 to find the specific noisy latents x_t at various timesteps.

As shown in Figure 2, by inverting the ODE (Ordinary Differential Equation) of the diffusion process, the method calculates the latent variables \(x_t\) that correspond to the target image \(x_0\).

The mathematical formulation for this inversion relies on finding the noise \(\delta_t\) that satisfies the transition from the clean image back to the noisy state:

Equation: The relationship between x_t, x_0, and the noise delta.

This inversion technique is what makes the “single-step” optimization possible. Instead of guessing which noise might lead to a good image, InPO mathematically calculates exactly which noise did lead to the good image in the dataset.

3. The Objective Function

With the specific latent variables \(x_t^w\) (winner) and \(x_t^l\) (loser) identified via inversion, the authors apply the Direct Preference Optimization (DPO) objective.

The goal is to minimize the difference between the model’s predicted noise and the “optimal” noise for the winner, while maximizing the difference for the loser (relative to a reference model to prevent mode collapse).

Through several derivations (involving Jensen’s inequality and KL-divergence bounds), the authors arrive at a clean, implementable loss function:

Equation: The final InPO Loss Function.

What this equation tells us:

We sample a timestep \(t\).
We take the winner \(w\) and loser \(l\) images.
We calculate the error (distance) between the optimal trajectory \(\tau\) and the model’s predicted noise \(\epsilon_\theta\).
We punish the model if the winner’s error is high and reward it if the winner’s error is lower than the loser’s.
It looks complex, but it essentially boils down to: Push the model’s noise prediction closer to the winner’s trajectory and further from the loser’s.

Experiments and Results

The theory sounds solid, but does it work? The authors tested InPO on Stable Diffusion 1.5 (SD1.5) and SDXL using the Pick-a-Pic v2 dataset.

1. Training Efficiency

This is the most impressive result. Because InPO targets specific latents rather than optimizing a whole chain blindly, it converges incredibly fast.

Figure 4: Efficiency comparison. InPO (Red) is 18.4x faster than Diffusion-KTO and 3.6x faster than Diffusion-DPO while achieving a higher PickScore.

As shown in the chart above, InPO achieves a higher PickScore (a metric for human preference) with significantly fewer GPU hours compared to baselines like Diffusion-DPO and KTO.

2. Human Evaluation

Metrics are useful, but human eye is the ultimate test. The authors conducted user studies comparing InPO against the base SDXL model and DPO-SDXL.

Figure 3: Human evaluation results. InPO (Blue bars) dominates in Visual Attractiveness and Text Alignment.

InPO-SDXL wins 60% of the time in overall preference against the base model and dominates DPO-SDXL in visual attractiveness (73% win rate).

3. Visual Comparisons

Let’s look at the actual outputs. The method excels at following prompts precisely where other models might hallucinate or ignore instructions.

Figure 16: Qualitative comparison. Look at the bottom row (Row 5). InPO correctly renders the text “STOP” on the towel, while the Base and DPO models struggle with spelling or formatting.

In the figure above, note the prompt “A towel with the word ‘stop’ printed on it.”

Base-SDXL: Spells it “SSOOP”.
DPO-SDXL: Spells it correctly but the font is messy.
InPO-SDXL: Perfect spelling, clean typography.

This demonstrates that InPO isn’t just improving “vibes” or colors; it’s improving the model’s fundamental control and instruction-following capabilities.

Conclusion

InPO (Inversion Preference Optimization) represents a significant step forward in aligning diffusion models. By shifting the perspective from a “multi-step Markov chain” to a “timestep-aware single-step generator,” the authors managed to:

Solve the Credit Assignment Problem: Using DDIM inversion to map images back to their specific latent noise.
Boost Efficiency: Drastically reducing the compute required for fine-tuning (just 400 steps).
Improve Quality: Achieving state-of-the-art results in visual appeal and text alignment.

For students and researchers, InPO highlights the power of reparameterization. Sometimes, the solution isn’t a bigger GPU or a larger dataset, but a mathematical change in perspective that simplifies the problem.

The method is open-source, and resources are available for those who want to experiment with aligning their own diffusion models.

The Problem: Sparse Rewards in a Long Chain#

The Core Method: InPO#

1. The Insight: Single-Step View via Reparameterization#

2. Finding the “Guilty” Latents via Inversion#

3. The Objective Function#

Experiments and Results#

1. Training Efficiency#

2. Human Evaluation#

3. Visual Comparisons#

Conclusion#