The Paradox of Perfection: Why Flawed Models are Creative
If you have ever played with generative AI tools like Stable Diffusion or Midjourney, you have witnessed a form of digital magic. You type a prompt, or provide random noise, and the system dreams up an image that has likely never existed before. It is original. It is creative.
But here lies a massive theoretical problem.
At their core, these diffusion models are trained to learn the probability distribution of their training data. If a model does its job perfectly—if it learns the “ideal” score function that describes the data distribution exactly—theory tells us it should only be able to reproduce the training data. A perfect model should be a memorization machine, incapable of generating anything truly new.
So, where does the creativity come from?
In a fascinating paper titled “An analytic theory of creativity in convolutional diffusion models,” researchers Mason Kamb and Surya Ganguli from Stanford University propose a groundbreaking answer. They argue that creativity in these models doesn’t arise from their perfection, but from their limitations. Specifically, the inductive biases of locality (seeing only small patches of an image) and equivariance (treating all locations similarly) prevent the model from memorizing the data.
Instead, these constraints force the model to become a “patch mosaic” machine—stitching together bits and pieces of training data in exponentially many new combinations.
In this post, we will deconstruct their theory, walk through the mathematics of the “Equivariant Local Score Machine,” and see how this analytic theory can predict the exact output of deep neural networks without training a single weight.
Part 1: The Trap of the Ideal Score
To understand why creativity is a puzzle, we first need to look at how diffusion models work mathematically.
The Forward and Reverse Process
Diffusion models operate on a simple premise: destroy data, then learn to fix it.
- Forward Process: We take an image from our training set and slowly add Gaussian noise to it over time (\(t=0\) to \(t=T\)). Eventually, the image becomes pure static.
- Reverse Process: We train a neural network to look at a noisy image and predict the “score function,” which effectively points in the direction of the original data (denoising).
Figure 1: The standard diffusion process. We turn training images (left) into noise (right), and train a model to reverse the process.
The reverse process is governed by a differential equation. To generate an image, we sample random noise and evolve it backward in time using the score function \(s_t(\phi)\).
Why the Ideal Score Memorizes
The “score function” is just the gradient of the log-probability of the data. If we have a finite training set \(\mathcal{D}\), the true distribution of our noisy data at any time \(t\) is actually a mixture of Gaussians, with one Gaussian centered at each training point.
Ideally, the score function looks like this:

This equation has a profound interpretation, often called the Bayesian guessing game.
- The term \(W_t(\varphi | \phi)\) represents a posterior belief: “Given the current noisy image \(\phi\), what is the probability it started as training image \(\varphi\)?”
- The score function is a weighted average that pulls the current image \(\phi\) toward every training image \(\varphi\), weighted by how likely it is that \(\phi\) originated from \(\varphi\).
The Problem: As the reverse process runs and noise is removed, that probability weight \(W_t\) rapidly collapses. The model becomes 99.99% sure that the current image belongs to one specific training example. The force pulls the image directly toward that single training example.
The result? Perfect memorization. A diffusion model that learns the ideal score function perfectly cannot be creative; it can only act as a lookup table for the training set.
Part 2: The Constraints that Create
Since real-world diffusion models do generate novel images, they clearly are not learning the ideal score function. They are failing. But they are failing in a very specific, structured way.
The authors identify two specific inductive biases present in Convolutional Neural Networks (CNNs) that break this perfect memorization:
- Locality: CNNs process images using small filters (kernels). A pixel’s value is updated based only on its immediate neighbors (its “receptive field”), not the whole image at once.
- Equivariance: CNNs share weights across the image. A “vertical edge detector” in the top-left corner works exactly the same way as one in the bottom-right. The model doesn’t inherently know where it is looking, only what it is looking at.
To test this hypothesis, the authors derive the mathematically optimal score functions subject to these two constraints. They call these theoretical constructs Score Machines.
The Three Machines
Figure 2: Visualizing the logic of different Score Machines. (a) The Ideal Score (IS) Machine maps the whole image to a single training image. (b) The Local Score (LS) Machine maps local patches to training patches at the same location. (c) The Equivariant Local Score (ELS) Machine maps patches to training patches from ANY location.
1. The Ideal Score (IS) Machine
This is the memorizer we discussed above. It looks at the whole image and pulls it toward the nearest global training image.
2. The Local Score (LS) Machine
This machine is constrained by locality. It breaks the image into small patches (e.g., \(3 \times 3\) pixels). For each patch, it runs the Bayesian guessing game independently.
- Constraint: It assumes a patch at location \((x,y)\) must come from a training patch at the exact same location \((x,y)\).
- Result: It creates a “Frankenstein” image where the top-left corner might come from Training Image A, and the bottom-right from Training Image B. However, because it’s tied to absolute coordinates, its creativity is limited.
3. The Equivariant Local Score (ELS) Machine
This is the breakthrough. This machine is constrained by both locality and equivariance.
- Constraint: It looks at a local patch, but because of equivariance (weight sharing), it loses the concept of absolute position. It asks: “Which patch in the entire training set does this look like?”
- Result: It pulls the current patch toward similar patches found anywhere in the training set.
This leads to Combinatorial Creativity. The ELS machine can take a texture from the corner of Image A, a shape from the center of Image B, and an edge from Image C, and stitch them together into a brand new Patch Mosaic.
Part 3: The Mathematics of the ELS Machine
The authors provide an analytic solution for the ELS machine. This is remarkable because it means we don’t need to train a network to see what it does; we can just compute it directly from the training data.
The ELS score function is defined as:

And the weights (the belief state) are calculated as:

Let’s break this down:
- \(\Omega_x\): This represents the local neighborhood (patch) around pixel \(x\).
- \(P_{\Omega}(\mathcal{D})\): This is the set of all possible patches extracted from the training set \(\mathcal{D}\).
- The Mechanism:
- The machine looks at your current noisy patch \(\phi_{\Omega_x}\).
- It compares it to every patch \(\varphi\) in the training set (Equation 2).
- It calculates a probability \(W_t\): “How likely is it that my current noisy patch is a noised version of training patch \(\varphi\)?”
- It then updates the pixel by taking a weighted average of the center pixels of those matching training patches (Equation 1).
The “Patch Mosaic” Effect
Because every pixel performs this calculation independently based on its local neighbors, the image evolves into a mosaic.
Figure 3: A simple proof of concept. (a) The training set is just two images: all black and all white. (b) An ELS machine generates “clouds.” Locally, every \(3\times3\) patch is consistent (mostly black or mostly white), but globally, they form new shapes not seen in the training set.
The theorem derived in the paper states that the ELS machine converges to Locally Consistent Points. A generated image is valid if every local patch inside it looks like some patch from the training set, even if the global arrangement is totally new.
Part 4: Does Theory Match Reality?
It is one thing to derive a math equation; it is another to prove that deep learning models actually behave this way. The authors compared their analytic ELS Machine against real, trained ResNets and UNets on datasets like MNIST, CIFAR10, and CelebA.
The results are startlingly accurate.
1. Case-by-Case Prediction
The authors fed the same random noise into their analytic ELS theory and a trained Neural Network.
Figure 4: Side-by-side comparison. The “Theory” columns are generated by the mathematical formula (ELS Machine). The “CNN” columns are generated by a trained neural network. The resemblance is uncanny.
Figure 5: Further comparisons. (a) ResNet on MNIST. (b) UNet on MNIST. The theory predicts the output of the black-box neural network with pixel-perfect precision in many cases.
Quantitatively, the ELS machine predicts the output of the trained networks with a median \(R^2\) (coefficient of determination) of roughly 0.95. This means 95% of the variance in the neural network’s creative output is explained purely by the ELS mechanism: mixing and matching local patches.
2. The Role of Boundaries
You might wonder: if the ELS machine has no concept of position (equivariance), how does it generate coherent faces in CelebA? Why doesn’t it put an eye in the chin?
The answer lies in Zero-Padding.
CNNs usually pad images with zeros at the borders. This seemingly minor implementation detail breaks perfect equivariance. A patch at the top-left corner sees a bunch of zeros above and to the left of it. A patch in the center does not.
Figure 6: Breaking equivariance with boundaries. (Left) A center patch matches against the whole image. (Right) A corner patch only matches against training patches that also have corner padding.
This allows the ELS machine (and the neural net) to anchor the image. It knows to put “top-left corner” patches in the top-left. The authors call this the Boundary-Broken ELS Machine, and it fits the data even better.
3. Coarse-to-Fine Generation
The researchers found that to fit the neural networks perfectly, they couldn’t use a fixed patch size.
- At the start of generation (high noise), the network acts like it has a large receptive field.
- At the end of generation (low noise), the network acts very locally (small receptive field).
Figure 7: (a) The effective receptive field of a trained network shrinks as time progresses. (b) The calibrated patch size \(P\) for the theory follows the same trend.
This explains the “coarse-to-fine” behavior of diffusion. First, the model establishes the global structure (using large patches), and then it refines the texture (using small patches).
Part 5: Explaining the Glitches
One of the strongest pieces of evidence for a scientific theory is its ability to explain anomalies. We have all seen AI art fails: a hand with six fingers, a person with three arms, or a shirt with two necklines.
The ELS theory predicts exactly this.
Because the model operates locally, a patch at location \(A\) decides to become a “sleeve.” A patch at location \(B\) (far away) also decides to become a “sleeve.” Because they are outside each other’s receptive fields, they don’t coordinate to say “Wait, this shirt already has a sleeve.”
Figure 8 (Panel c): Look at the bottom row. Both the Theory (Left) and the trained CNN (Right) generate a shirt with three arms. The theory mechanistically explains why: excessive locality at late times in the generation process.
This confirms that these spatial inconsistencies are not random bugs; they are fundamental artifacts of the ELS mechanism driving the creativity.
Conclusion: The Role of Attention
The theory presented here explains Convolutional Neural Networks (ResNets, standard UNets) almost perfectly. But modern state-of-the-art models (like Stable Diffusion) use Self-Attention.
Attention mechanisms are non-local. They allow every pixel to “talk” to every other pixel, regardless of distance. Does the ELS theory break down?
The authors tested this by comparing their local ELS theory against a UNet with Self-Attention (UNet+SA).
Figure 9: The top row is the Attention model. The bottom row is the ELS theory. Attention helps “carve out” coherent objects.
The local theory still predicts the texture and general shape (\(R^2 \approx 0.77\)), but Attention clearly adds a layer of semantic coherence. As seen in Figure 9, where the ELS machine makes a blob of fur, the Attention model carves out a distinct animal. The ELS machine provides the raw “patch mosaic” material, and Attention sculpts it into a coherent object.
Summary
Kamb and Ganguli’s work provides a demystifying look into the “black box” of generative AI.
- Memorization is the default: Without constraints, diffusion models would just copy data.
- Constraints breed creativity: The limits of CNNs (locality and equivariance) force them to remix data rather than repeat it.
- We can predict the output: The behavior of these complex networks can be modeled by analytic equations that sum over training patches.
This paper suggests that the “creativity” we admire in AI is, effectively, a highly sophisticated form of collage—a locally consistent patch mosaic stitched together by the mathematics of probability.
](https://deep-paper.org/en/paper/2412.20292/images/cover.png)