In the last decade, AI has dazzled the world with deep generative models capable of producing realistic images, audio, and text from scratch. We’ve seen Generative Adversarial Networks (GANs) generate lifelike portraits and Variational Autoencoders (VAEs) learn rich latent representations. But in 2020, a paper titled Denoising Diffusion Probabilistic Models from researchers at UC Berkeley reshaped the conversation.
This work introduced a class of models, based on ideas from nonequilibrium thermodynamics first explored in 2015, that were shown for the first time to produce exceptionally high-quality images, rivaling — and in some cases surpassing — the best GANs.
These models, now widely known as Denoising Diffusion Models (DDPMs), operate on a beautifully intuitive principle:
- Start with a clear image.
- Gradually destroy it by adding noise.
- Learn how to reconstruct it by reversing the noise process step-by-step.
This idea — a deliberate destruction followed by learned restoration — turned out to be an incredibly effective way to capture the intricate statistical patterns of real-world data.
Figure 1: Generated high-quality samples from CelebA-HQ (left) and unconditional CIFAR10 (right).
The Two-Step Dance: How Diffusion Models Work
At their core, diffusion models consist of two opposing processes:
- A forward process that systematically adds noise to an image.
- A reverse process that learns to remove the noise.
Figure 2: The Markov chain structure of the forward (noising) and reverse (denoising) processes.
These processes are implemented as Markov chains, where each step depends only on the previous one.
1. The Forward Process (Diffusion)
Imagine you have a clean image \(\mathbf{x}_0\).
The forward process \(q\) adds small amounts of Gaussian noise over \(T\) timesteps: \(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T\).
The per-step transition is defined as:
\[ q(\mathbf{x}_t|\mathbf{x}_{t-1}) \coloneqq \mathcal{N}\big(\mathbf{x}_t; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\,\beta_t\,\mathbf{I}\big) \]Here:
- \(\beta_t\) controls the noise level for step \(t\).
- The mean is a scaled version of \(\mathbf{x}_{t-1}\); the variance is \(\beta_t\mathbf{I}\).
By the final step \(T\) (the paper uses \(T = 1000\)), the image has been almost completely erased into pure Gaussian noise.
Elegant shortcut:
Because Gaussians compose nicely, we can jump directly to any \(\mathbf{x}_t\) from \(\mathbf{x}_0\) without simulating intermediate steps:
where \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{s=1}^t \alpha_s\).
This property allows efficient training: we can sample a noisy frame at any random timestep in one shot.
2. The Reverse Process (Generation)
The forward process is fixed. The reverse process \(p_{\theta}\) is where learning happens — the model learns to undo the noising.
Starting with \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\), the reverse process iteratively predicts the slightly less noisy image \(\mathbf{x}_{t-1}\) conditioned on \(\mathbf{x}_t\):
\[ p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_t) := \mathcal{N}\big(\mathbf{x}_{t-1};\, \boldsymbol{\mu}_{\theta}(\mathbf{x}_t, t),\, \boldsymbol{\Sigma}_{\theta}(\mathbf{x}_t, t)\big) \]The mean and variance are predicted by a neural network parameterized by \(\theta\).
Training with Variational Inference
We train the network by maximizing the data log likelihood via the Variational Lower Bound (VLB), also known as the ELBO.
The bound decomposes naturally into terms for different timesteps:
The VLB splits into \(L_T\), a set of \(L_{t-1}\) denoising terms for \(t>1\), and \(L_0\) for final reconstruction.
Interpretation of terms:
- \(L_T\): Compares the final noisy latent \(\mathbf{x}_T\) to a standard Gaussian. With the fixed forward process, it’s constant and ignored during training.
- \(L_{t-1}\): KL divergence between the model’s predicted reverse step and the true posterior from the forward process. This is the core denoising objective.
- \(L_0\): A reconstruction term for decoding \(\mathbf{x}_0\) from \(\mathbf{x}_1\).
Because the true posterior is Gaussian, these KL divergences have closed-form solutions — making training stable and efficient.
What Made DDPMs So Good?
Earlier diffusion models (e.g., Sohl-Dickstein et al., 2015) did not achieve high sample quality. The DDPM paper added critical design choices and a key insight that changed the game.
Insight: Predict the Noise, Not the Mean
The naïve approach is to have the network predict \(\tilde{\boldsymbol{\mu}}_t\), the mean of the forward process posterior:
\[ L_{t-1} = \mathbb{E}_q \left[ \frac{1}{2\sigma_t^2} \|\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\|^2 \right] + C \]But notice:
From the forward process,
\(\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}\).
Rewriting \(\mathbf{x}_0\) in terms of \(\mathbf{x}_t\) and \(\boldsymbol{\epsilon}\), you discover predicting \(\tilde{\boldsymbol{\mu}}_t\) is equivalent to predicting the noise \(\boldsymbol{\epsilon}\).
Thus, the authors reparameterize the reverse mean as:
Here, the network \(\boldsymbol{\epsilon}_{\theta}\) predicts the noise added at step \(t\).
Simplifying the Loss
With this parameterization, the loss reduces to a simple weighted MSE between actual and predicted noise:
However, the weighting term tended to downplay noisy timesteps (large \(t\)).
Empirically, removing the weighting entirely improved sample quality.
The final simplified objective is:
Simplified Training Loop:
- Pick a random data image \(\mathbf{x}_0\).
- Pick a random timestep \(t \in [1, T]\).
- Sample Gaussian noise \(\boldsymbol{\epsilon}\).
- Compute \(\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}\).
- Feed \(\mathbf{x}_t\) and \(t\) into the U-Net to predict \(\hat{\boldsymbol{\epsilon}}\).
- Minimize MSE between \(\boldsymbol{\epsilon}\) and \(\hat{\boldsymbol{\epsilon}}\).
Experimental Results
The authors evaluated DDPMs on CIFAR10, CelebA-HQ, and LSUN.
Table 1: CIFAR10 scores. DDPMs achieve state-of-the-art FID (3.17) and competitive Inception Score (9.46).
FID (Fréchet Inception Distance, lower is better) and IS (Inception Score, higher is better) metrics confirm DDPM’s top-tier fidelity.
On higher-resolution data:
Figure 3: LSUN Church samples. FID = 7.89.
Figure 4: LSUN Bedroom samples. FID = 4.90.
Ablation: Why \(L_{\text{simple}}\) Matters
Predicting \(\epsilon\) with the simplified loss beats all alternatives.
The baseline mean-prediction approach with the full variational bound performs notably worse.
Rate–Distortion Analysis: Great Images, Mediocre Likelihoods?
While DDPMs generate outstanding images, their log likelihoods lag behind other likelihood-based models.
The authors suggest DDPMs act as excellent lossy compressors: most bits model perceptually important large features; fewer bits target imperceptible detail.
Progressive Decoding:
At any reverse step \(t\), we can estimate:
Figure 5: Distortion vs. rate — most perceptible structure is reconstructed at low rates.
Visualization: Progressive Generation
As the reverse process runs, large-scale features emerge before fine details:
Figure 6: DDPM generation over time. Coarse shapes first, sharper details later.
Conclusion and Impact
The Denoising Diffusion Probabilistic Models paper marked a turning point in generative modeling:
- State-of-the-art quality with stable training — rivaling top GANs without their instability.
- Noise prediction (\(\epsilon\)-parameterization) and a simplified loss were key to unlocking quality.
- Progressive generation reveals deep links to denoising score matching, Langevin dynamics, and lossy compression.
These ideas form the basis of modern diffusion models used in systems like Stable Diffusion and DALL·E 2, extending far beyond images to audio, video, and multimodal generation.
By showing how reversing noise can lead to high-fidelity generation, this work did more than release a new model — it inspired a new lens through which to view creation itself.