In the last decade, AI has dazzled the world with deep generative models capable of producing realistic images, audio, and text from scratch. We’ve seen Generative Adversarial Networks (GANs) generate lifelike portraits and Variational Autoencoders (VAEs) learn rich latent representations. But in 2020, a paper titled Denoising Diffusion Probabilistic Models from researchers at UC Berkeley reshaped the conversation.

This work introduced a class of models, based on ideas from nonequilibrium thermodynamics first explored in 2015, that were shown for the first time to produce exceptionally high-quality images, rivaling — and in some cases surpassing — the best GANs.

These models, now widely known as Denoising Diffusion Models (DDPMs), operate on a beautifully intuitive principle:

  1. Start with a clear image.
  2. Gradually destroy it by adding noise.
  3. Learn how to reconstruct it by reversing the noise process step-by-step.

This idea — a deliberate destruction followed by learned restoration — turned out to be an incredibly effective way to capture the intricate statistical patterns of real-world data.

Generated samples from the DDPM paper. On the left are 256×256 faces from the CelebA-HQ dataset. On the right are 32×32 images from the unconditional CIFAR10 dataset.

Figure 1: Generated high-quality samples from CelebA-HQ (left) and unconditional CIFAR10 (right).


The Two-Step Dance: How Diffusion Models Work

At their core, diffusion models consist of two opposing processes:

  • A forward process that systematically adds noise to an image.
  • A reverse process that learns to remove the noise.

Diagram showing the forward (noising) and reverse (denoising) processes. The forward process <code>q</code> gradually adds Gaussian noise to an image <code>x₀</code> until it becomes pure noise <code>x_T</code>. The reverse process <code>p_θ</code> learns to transform <code>x_T</code> back into a clean image <code>x₀</code>.

Figure 2: The Markov chain structure of the forward (noising) and reverse (denoising) processes.

These processes are implemented as Markov chains, where each step depends only on the previous one.


1. The Forward Process (Diffusion)

Imagine you have a clean image \(\mathbf{x}_0\).
The forward process \(q\) adds small amounts of Gaussian noise over \(T\) timesteps: \(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T\).

The per-step transition is defined as:

\[ q(\mathbf{x}_t|\mathbf{x}_{t-1}) \coloneqq \mathcal{N}\big(\mathbf{x}_t; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\,\beta_t\,\mathbf{I}\big) \]

Here:

  • \(\beta_t\) controls the noise level for step \(t\).
  • The mean is a scaled version of \(\mathbf{x}_{t-1}\); the variance is \(\beta_t\mathbf{I}\).

By the final step \(T\) (the paper uses \(T = 1000\)), the image has been almost completely erased into pure Gaussian noise.


Elegant shortcut:
Because Gaussians compose nicely, we can jump directly to any \(\mathbf{x}_t\) from \(\mathbf{x}_0\) without simulating intermediate steps:

\[ q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}\big(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0,\, (1 - \bar{\alpha}_t) \mathbf{I}\big) \]

where \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{s=1}^t \alpha_s\).

This property allows efficient training: we can sample a noisy frame at any random timestep in one shot.


2. The Reverse Process (Generation)

The forward process is fixed. The reverse process \(p_{\theta}\) is where learning happens — the model learns to undo the noising.

Starting with \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\), the reverse process iteratively predicts the slightly less noisy image \(\mathbf{x}_{t-1}\) conditioned on \(\mathbf{x}_t\):

\[ p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_t) := \mathcal{N}\big(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_{\theta}(\mathbf{x}_t, t),\, \boldsymbol{\Sigma}_{\theta}(\mathbf{x}_t, t)\big) \]

The mean and variance are predicted by a neural network parameterized by \(\theta\).


Training with Variational Inference

We train the network by maximizing the data log likelihood via the Variational Lower Bound (VLB), also known as the ELBO.
The bound decomposes naturally into terms for different timesteps:

The variational lower bound objective, broken down into three terms: L_T, L_{t-1}, and L_0.

The VLB splits into \(L_T\), a set of \(L_{t-1}\) denoising terms for \(t>1\), and \(L_0\) for final reconstruction.

Interpretation of terms:

  • \(L_T\): Compares the final noisy latent \(\mathbf{x}_T\) to a standard Gaussian. With the fixed forward process, it’s constant and ignored during training.
  • \(L_{t-1}\): KL divergence between the model’s predicted reverse step and the true posterior from the forward process. This is the core denoising objective.
  • \(L_0\): A reconstruction term for decoding \(\mathbf{x}_0\) from \(\mathbf{x}_1\).

Because the true posterior is Gaussian, these KL divergences have closed-form solutions — making training stable and efficient.


What Made DDPMs So Good?

Earlier diffusion models (e.g., Sohl-Dickstein et al., 2015) did not achieve high sample quality. The DDPM paper added critical design choices and a key insight that changed the game.


Insight: Predict the Noise, Not the Mean

The naïve approach is to have the network predict \(\tilde{\boldsymbol{\mu}}_t\), the mean of the forward process posterior:

\[ L_{t-1} = \mathbb{E}_q \left[ \frac{1}{2\sigma_t^2} \|\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\|^2 \right] + C \]

But notice: From the forward process,
\(\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}\).

Rewriting \(\mathbf{x}_0\) in terms of \(\mathbf{x}_t\) and \(\boldsymbol{\epsilon}\), you discover predicting \(\tilde{\boldsymbol{\mu}}_t\) is equivalent to predicting the noise \(\boldsymbol{\epsilon}\).

Thus, the authors reparameterize the reverse mean as:

The reparameterization of the reverse process mean to predict noise instead of the mean directly.

\[ \boldsymbol{\mu}_{\theta}(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \,\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t, t) \right) \]

Here, the network \(\boldsymbol{\epsilon}_{\theta}\) predicts the noise added at step \(t\).


Simplifying the Loss

With this parameterization, the loss reduces to a simple weighted MSE between actual and predicted noise:

The simplified loss term, a weighted mean squared error between the true and predicted noise.

However, the weighting term tended to downplay noisy timesteps (large \(t\)).
Empirically, removing the weighting entirely improved sample quality.

The final simplified objective is:

The simplified objective function, L_simple, which is an unweighted mean squared error between the true and predicted noise.

\[ L_{\text{simple}}(\theta) \coloneqq \mathbb{E}_{t,\mathbf{x}_0,\epsilon} \big[ \|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, t)\|^2 \big] \]

Simplified Training Loop:

  1. Pick a random data image \(\mathbf{x}_0\).
  2. Pick a random timestep \(t \in [1, T]\).
  3. Sample Gaussian noise \(\boldsymbol{\epsilon}\).
  4. Compute \(\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}\).
  5. Feed \(\mathbf{x}_t\) and \(t\) into the U-Net to predict \(\hat{\boldsymbol{\epsilon}}\).
  6. Minimize MSE between \(\boldsymbol{\epsilon}\) and \(\hat{\boldsymbol{\epsilon}}\).

Experimental Results

The authors evaluated DDPMs on CIFAR10, CelebA-HQ, and LSUN.

Table of results for CIFAR10, comparing the DDPM model to other leading generative models.

Table 1: CIFAR10 scores. DDPMs achieve state-of-the-art FID (3.17) and competitive Inception Score (9.46).

FID (Fréchet Inception Distance, lower is better) and IS (Inception Score, higher is better) metrics confirm DDPM’s top-tier fidelity.
On higher-resolution data:

High-quality 256×256 images of churches generated by the DDPM.

Figure 3: LSUN Church samples. FID = 7.89.

High-quality 256×256 images of bedrooms generated by the DDPM.

Figure 4: LSUN Bedroom samples. FID = 4.90.


Ablation: Why \(L_{\text{simple}}\) Matters

Ablation study table comparing reverse process parameterizations and objectives.

Predicting \(\epsilon\) with the simplified loss beats all alternatives.
The baseline mean-prediction approach with the full variational bound performs notably worse.


Rate–Distortion Analysis: Great Images, Mediocre Likelihoods?

While DDPMs generate outstanding images, their log likelihoods lag behind other likelihood-based models.

The authors suggest DDPMs act as excellent lossy compressors: most bits model perceptually important large features; fewer bits target imperceptible detail.

Progressive Decoding:
At any reverse step \(t\), we can estimate:

\[ \hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t, t)}{\sqrt{\bar{\alpha}_t}} \]

Rate–distortion curves for DDPM on CIFAR10. Distortion drops quickly at low rates, showing important features are captured first.

Figure 5: Distortion vs. rate — most perceptible structure is reconstructed at low rates.


Visualization: Progressive Generation

As the reverse process runs, large-scale features emerge before fine details:

Progressive generation: noise evolves into airplanes, birds, and animals.

Figure 6: DDPM generation over time. Coarse shapes first, sharper details later.


Conclusion and Impact

The Denoising Diffusion Probabilistic Models paper marked a turning point in generative modeling:

  1. State-of-the-art quality with stable training — rivaling top GANs without their instability.
  2. Noise prediction (\(\epsilon\)-parameterization) and a simplified loss were key to unlocking quality.
  3. Progressive generation reveals deep links to denoising score matching, Langevin dynamics, and lossy compression.

These ideas form the basis of modern diffusion models used in systems like Stable Diffusion and DALL·E 2, extending far beyond images to audio, video, and multimodal generation.

By showing how reversing noise can lead to high-fidelity generation, this work did more than release a new model — it inspired a new lens through which to view creation itself.