Introduction
In the rapidly evolving landscape of Artificial Intelligence, generative models—from GANs to Diffusion models—have become incredibly adept at creating realistic images. When researchers release a new model, they typically attach a scorecard: a Fréchet Inception Distance (FID) or an Inception Score (IS). These metrics provide a single number indicating how “good” the generated images are compared to a reference dataset.
But a single number hides a multitude of sins. Two models might have the same FID score but fail in completely different ways. One might refuse to generate cats; the other might generate only cats. One might memorize the training data; the other might hallucinate wildly. Quantitative scores cannot answer the qualitative question: “What specifically is this model generating differently compared to the reference data?”
To answer this, we need a way to identify specific “clusters” or “modes” of images that appear frequently in one distribution but not in another. This is known as differential clustering. While methods exist to do this, they have historically been computationally expensive, often requiring calculations that grow cubically with the dataset size. This makes them useless for massive datasets like ImageNet.
In this post, we dive deep into a research paper that proposes a solution: Fourier-based Identification of Novel Clusters (FINC). This method utilizes the mathematical power of Random Fourier Features to perform differential clustering scalably, allowing us to peek inside the “black box” of generative models and see exactly what they are—and aren’t—producing.
The Problem: Why Scores Are Not Enough
Imagine you are training a generative model to produce images of animals. You train it on a balanced dataset of dogs, cats, and wildlife. After training, you run an evaluation metric like FID. The score is low (which is good), suggesting the model’s distribution is close to the real data.
However, visual inspection might reveal that your model is obsessed with Golden Retrievers and barely generates any Schnauzers. Or perhaps it generates “Frankenstein” animals that don’t exist in nature. Standard metrics aggregate these discrepancies into a global distance, losing the nuance of which specific types of samples are causing the drift.
To perform a fine-grained comparison, we need to solve a differential clustering problem. Given a “Test” distribution (our generative model) and a “Reference” distribution (the real data or another model), we want to identify clusters of samples that are overrepresented in the Test distribution.
The Scalability Bottleneck
Previous approaches like the Rarity score, Feature Likelihood Divergence (FLD), or Kernel-based Entropic Novelty (KEN) attempt to measure how “novel” or “rare” a sample is. While theoretically sound, they suffer from a major flaw: computational complexity.
Most spectral (eigenvector-based) clustering methods rely on constructing a Kernel Matrix. If you have \(n\) images, the matrix is size \(n \times n\). To find the clusters, you must perform an eigendecomposition, which typically costs \(O(n^3)\).
For a dataset like ImageNet (1.4 million images), an \(O(n^3)\) operation is impossible. Even with a subset of 50,000 images, standard methods crash GPU memory or take days to compute. FINC addresses this bottleneck by changing the mathematical foundation of how we estimate the data’s structure.
Core Method: Fourier-based Identification of Novel Clusters (FINC)
The core idea of FINC is to compare the covariance structures of two distributions in a way that doesn’t depend on the number of samples (\(n\)), but rather on a fixed number of features (\(r\)).
1. The Kernel Covariance Matrix
To compare distributions, we operate in a “feature space.” We use a kernel function, typically a Gaussian kernel, which measures the similarity between two images \(\mathbf{x}\) and \(\mathbf{x}'\).

This kernel implicitly maps the data into a high-dimensional space. We can describe the “shape” of the data distribution in this space using an empirical Kernel Covariance Matrix, \(\widehat{C}_X\).

Here, \(\phi(\mathbf{x}_i)\) represents the feature map of image \(i\). If we computed this directly using the “kernel trick,” we would end up with the massive \(n \times n\) matrix mentioned earlier.
2. The Differential Clustering Objective
We aren’t just interested in the shape of one distribution; we want the difference between the Test distribution (\(\mathbf{X}\)) and the Reference distribution (\(\mathbf{Y}\)). The paper defines a “conditional” covariance matrix \(\Lambda_{\mathbf{X}|\rho\mathbf{Y}}\):

The parameter \(\rho\) (rho) serves as a threshold. It represents how much more frequent a mode must be in the Test distribution to be considered “novel.” If we find the eigenvectors of this matrix corresponding to positive eigenvalues, we effectively identify the “directions” (or clusters of images) that are dominant in \(\mathbf{X}\) but suppressed or absent in \(\mathbf{Y}\).
3. The Solution: Random Fourier Features (RFF)
To make this scalable, the authors utilize Random Fourier Features. Bochner’s Theorem states that for a shift-invariant kernel (like Gaussian), the kernel function is the Fourier transform of a probability density function.
(Note: The above image derives the probability density function for the random sampling).
This theorem allows us to approximate the infinite-dimensional feature map \(\phi(\mathbf{x})\) using a finite, low-dimensional map \(\widetilde{\phi}_r(\mathbf{x})\). Instead of comparing every image to every other image, we project every image into a fixed vector of size \(2r\) using random frequencies \(\omega_1, \dots, \omega_r\) sampled from a Gaussian distribution.
The approximate feature map looks like this:

By using this approximation, we can compute the covariance matrices \(\widetilde{C}_X\) and \(\widetilde{C}_Y\) which are only of size \(2r \times 2r\). Crucially, \(r\) is a number we choose (e.g., 2000), independent of the dataset size \(n\).
4. The FINC Algorithm
The algorithm proceeds in three straightforward steps:
- Feature Extraction: Use a pre-trained network (like DINOv2) to turn images into vectors.
- Random Projection: Draw \(r\) random frequency vectors \(\omega\). Compute the sine and cosine projections for all \(n\) samples.
- Spectral Decomposition: Compute the approximate conditional covariance matrix:

We then simply find the eigenvectors of this \(2r \times 2r\) matrix. These eigenvectors represent the “novel modes.” We can then score every image in our dataset based on how well it aligns with these eigenvectors to see which images belong to which cluster.
5. Theoretical Guarantees
Does this approximation actually work? The authors provide a theoretical bound proving that the error introduced by Random Fourier Features decreases as we increase \(r\).

Roughly speaking, to get an error within \(\epsilon\), the number of features \(r\) needs to grow logarithmically with the sample size \(n\). This is a massive improvement over the polynomial growth required by previous methods. It means FINC is statistically consistent and computationally efficient.
Experiments & Results
The researchers validated FINC on controlled datasets and massive real-world image datasets.
Sanity Check: Colored MNIST
To prove the method works, they created a “Test” dataset of handwritten digits where 50% of the digits were colored. The “Reference” dataset contained only grayscale digits. A good differential clustering algorithm should immediately identify the colored digits as the “novelty.”

As shown in Figure 1, FINC successfully identified the top novel modes as the colored digits (e.g., red 1s, green 7s), distinguishing them from the grayscale background distribution.
Scalability vs. Baselines
The most impressive result is the computational speedup. The authors compared FINC against Rarity, FLD, and KEN scores on ImageNet.

Looking at Table 1, as the sample size grows to 50k, 100k, and 250k:
- Baselines (Rarity, FLD, KEN): They either crash due to memory overflow ("-") or take longer than 24 hours.
- FINC: It scales linearly. For 250k samples, it finishes in roughly 3000 seconds on a CPU and just 74 seconds on a GPU. This capability unlocks the analysis of entire training sets.
Unveiling Generative Model Biases
The authors applied FINC to state-of-the-art generative models (BigGAN, LDM, DiT, etc.) trained on ImageNet. By setting the reference distribution as the actual ImageNet dataset, they could ask: “What does the model generate too much of?”

In the figure above (Figure 3 in the paper), we see specific modes collapsed or overrepresented by different models:
- BigGAN tends to over-generate mushrooms (fungi).
- LDM (Latent Diffusion) has a strong bias toward generating Koalas.
- DiT-XL over-represents specific flower patterns.
This insight is invaluable for debugging models. An FID score of 10.5 wouldn’t tell you “Stop generating so many Koalas,” but FINC does.
Novelty Detection
Conversely, we can treat the generative model as the reference and the real dataset as the test to see what the model is missing (underrepresented modes), or compare two different generative models against each other.
For example, comparing AFHQ (Animal Faces) against ImageNet-Dogs:

FINC correctly identifies that the “novelty” in AFHQ compared to ImageNet Dogs is… Cats and Wildlife. It separates these distinct visual clusters cleanly.
Detecting Memorization
A critical issue in Generative AI is memorization—when a model simply copies training data rather than learning to generalize. FINC can cluster samples that have high memorization scores.

Figure 11 shows clusters of generated images that are highly memorized. For example, BigGAN has memorized “Hamburgers” and “Yellow Fields.” The bottom half of the figure shows the closest training samples, confirming that the model is essentially regurgitating specific training examples.
Conclusion and Implications
The “Unveiling Differences in Generative Models” paper presents a significant step forward in how we evaluate AI. By moving beyond scalar metrics like FID and solving the differential clustering problem with Random Fourier Features, FINC offers a tool that is both interpretable and scalable.
Key Takeaways:
- Specifics over Aggregates: We can now identify exactly which types of images a model over-produces or under-produces.
- Scalability: FINC turns an intractable \(O(n^3)\) problem into a manageable linear-time operation, making it feasible for modern, massive datasets.
- Versatility: The method works for finding novel modes, detecting mode collapse, identifying memorization, and even debugging text-to-image alignment issues.
As generative models continue to grow in size and capability, tools like FINC will be essential. They allow engineers and researchers to move from asking “Is this model good?” to the more important question: “What does this model actually know?”
](https://deep-paper.org/en/paper/2405.02700/images/cover.png)