In the era of Generative AI, we have grown accustomed to a simple truth known as “Scaling Laws”: if you want a better model, you need to train it on more data, with more parameters, for a longer time. This recipe has driven the explosive success of Large Language Models (LLMs) and diffusion models alike.
But recently, a new frontier has opened up in the LLM world. Researchers have discovered that you don’t always need a bigger model to get a smarter answer; sometimes, you just need to let the model “think” longer during inference. Techniques like Chain-of-Thought or Tree of Search allow models to scale their performance after training is complete, simply by using more compute power to generate the response.
This brings us to a fascinating question: Can we apply this same logic to Diffusion Models?
Currently, the standard way to generate an image is to run a diffusion model for a fixed number of “denoising steps.” We know that increasing these steps helps, but only up to a point. After a few dozen steps, the quality plateaus. In a recent CVPR paper, researchers from NYU, MIT, and Google propose a new framework to break through this ceiling. They treat image generation not just as a denoising process, but as a search problem.
In this post, we will break down their framework for “Scaling Inference Time Compute,” explore how searching for the “perfect noise” can yield better images than simply running a larger model, and analyze the trade-offs involved.
The Problem: The Inference Plateau
Diffusion models work by reversing a noise process. They start with a random Gaussian noise vector and progressively remove that noise to reveal a clean image. This process is governed by differential equations—specifically Ordinary Differential Equations (ODEs) or Stochastic Differential Equations (SDEs).

The variable \(t\) represents time (or noise level). As the model steps through time, it cleans the image. Naturally, one might assume that to get a better image, we should just increase the number of steps (the resolution of the solver).
However, empirical evidence shows a harsh reality: returns diminish quickly. Once you reach a sufficient number of steps (often between 50 and 100), adding more computation barely changes the output quality. The model has converged. If we want to scale inference compute effectively, we need a strategy that goes beyond just “denoising more.”
The authors of this paper propose that the key lies in the initial noise. In diffusion models, the random noise you start with determines the final image. It turns out that not all noise is created equal—some random seeds naturally lead to high-quality, aesthetically pleasing images, while others lead to artifacts or boring compositions.
If we can spend our compute budget searching for those “golden” noise vectors rather than just denoising the first one we find, we can unlock a new trajectory of scaling.

As shown in Figure 1 above, the standard method (dashed line) flatlines. In contrast, the proposed “Search” method (solid line) continues to improve image quality (measured by FID and Aesthetic scores) as we invest more compute (NFE - Number of Function Evaluations).
The Framework: Verifiers and Algorithms
To turn image generation into a search problem, we need to formalize two things:
- Verifiers: How do we know if an image is “good”?
- Algorithms: How do we find the next candidate to test?
The researchers structure their design space along these two axes.
1. Verifiers
A verifier is simply a function that takes a generated image (and optionally the text prompt) and outputs a score.

The paper explores three types of verifiers:
- Oracle Verifiers: These are used for academic benchmarking (e.g., maximizing the exact metrics used for evaluation, like FID or Inception Score). They represent the “upper bound” of what search can achieve.
- Supervised Verifiers: These use pre-trained models to judge quality. For example, using a CLIP model to judge how well an image matches a prompt, or an Aesthetic Predictor to judge visual beauty.
- Self-Supervised Verifiers: Interestingly, the authors found that sometimes you don’t need external labels. You can measure internal consistency—for example, how similar the model’s prediction at a high noise level is to the final clean image.
2. Algorithms
Once we have a way to score images, we need a strategy to find the high-scoring ones. The paper defines algorithms as functions that take the verifier, the model, and a set of candidates, and output a new set of refined candidates.

The authors propose three distinct search strategies, visualized below:

- Random Search (Left): This is the simplest approach, often called “Best-of-N.” You sample \(N\) different random noise vectors, generate images for all of them, score them, and pick the winner. It explores the global space well but doesn’t refine specific images.
- Zero-Order Search (Center): This is a local refinement strategy. You start with a noise vector, then sample \(N\) candidates in its neighborhood (slightly perturbed versions). You pick the best one and repeat the process. It’s like climbing a hill in the noise landscape without needing to calculate expensive gradients.
- Search over Paths (Right): This is the most complex. Instead of just searching for the initial noise, the algorithm branches out at intermediate steps of the denoising process. It allows the model to correct course mid-generation.
Experimental Results
Does this actually work? The authors conducted extensive experiments on ImageNet (class-conditional generation) and text-to-image benchmarks.
1. Random Search is surprisingly effective
On ImageNet, using simple Random Search with “Oracle” verifiers (optimizing directly for FID or Inception Score) yields massive improvements.

In Figure 3, notice the steep improvement curves. By generating more candidates and picking the best, the FID (lower is better) drops significantly, and the Inception Score (IS, higher is better) skyrockets. This confirms that the latent space of diffusion models is packed with higher-quality samples that we usually miss just by taking the first random sample.
2. The Danger of “Verifier Hacking”
When moving to real-world scenarios (where we don’t have an Oracle), the choice of verifier becomes critical. The researchers tested using CLIP and DINO (computer vision models) as verifiers.

The results in Figure 4 (left/top in the image above) show that these verifiers do improve metrics, but there’s a catch. If a verifier is not perfectly aligned with human perception, the search algorithm might “hack” it—finding images that score high on the metric but look weird or lack diversity. This is similar to “reward hacking” in Reinforcement Learning.
Interestingly, Figure 5 (right/bottom) highlights the potential of Self-Supervised Verifiers. By simply measuring the feature similarity between the model’s early prediction and the final result, they achieved strong scaling performance without needing any external conditioning information.
3. Comparing Algorithms
Is it worth using complex iterative algorithms over simple Random Search?

Figure 6 suggests that while local search methods (Zero-Order and Paths) are effective, they behave differently. Random Search is essentially a “shotgun” approach that is great for diversity. Zero-Order search (ZO) is efficient at polishing a specific sample (finding a local maximum), which can be better when you want to refine a specific concept rather than explore new ones.
4. Visualizing the Difference
The most compelling argument for search comes from looking at the images themselves.

Figure 7 is crucial for understanding the qualitative difference:
- Top Row (Increasing Denoising Steps): The image becomes cleaner and sharper, but the fundamental composition (a lighthouse) remains static.
- Bottom Rows (Increasing Search): The model explores different lighting, compositions, and styles. Look at the “Hourglass” or the “Teddy Bear.” The search process finds samples with dramatic lighting or more interesting accessory details (like the headphones on the bear) that a single random sample missed.
Scaling in Text-to-Image Generation
The authors applied this framework to a state-of-the-art text-to-image model, FLUX.1-dev. They used a “Verifier Ensemble”—combining Aesthetic scores, CLIP scores, and ImageReward—to robustly judge image quality.

Figure 8 shows the relative improvement. Using the Verifier Ensemble (the rightmost group) yielded consistent improvements across all metrics, including an independent LLM-based evaluation. This highlights that for complex text-to-image tasks, relying on a single metric (like just Aesthetic score) can lead to trade-offs, whereas an ensemble provides a balanced scaling path.
Furthermore, this search approach is compatible with other alignment techniques. As shown in Table 2, applying search to a model that has already been fine-tuned with Direct Preference Optimization (DPO) yields even further gains.

Small Models with Search vs. Large Models
Perhaps the most practical finding for students and practitioners is the efficiency trade-off. Can a small model with search beat a large model without it?

Figure 10 provides a resounding “Yes.” Look at the green line (SiT-L with search) crossing the orange line (SiT-XL without search).
- Key Takeaway: With a fixed compute budget, it is often better to use a smaller, faster model and spend the extra compute on searching for good samples, rather than spending that compute on a single pass of a massive model.
This is further detailed in the per-iteration analysis:

Figure 9 shows that there is a “sweet spot” for how many denoising steps to use during the search (NFEs/iter). You don’t need to run the full expensive generation for every candidate you test. You can search cheaply and then generate the final winner with high quality.
Conclusion
This research effectively establishes an “inference-time scaling law” for diffusion models. Just as LLMs benefit from “thinking time,” diffusion models benefit from “searching time.”
The implications are significant:
- Flexibility: We can trade compute for quality dynamically. If you need a quick draft, run once. If you need a masterpiece, run a search for 10 minutes.
- Model Design: We might stop obsessing over training the absolute largest model possible and focus on training models that are efficient searchers.
- The Verifier Bottleneck: The limiting factor is no longer the generative model, but the verifier. As we develop better automated ways to judge image quality (that align with human preference), this search-based approach will only get more powerful.
For students interested in this field, the “Search Framework” opens up a massive design space. What are better algorithms than Random Search? Can we train specific “Search Verifiers”? The shift from pure generation to Generation-via-Search is just beginning.
](https://deep-paper.org/en/paper/file-2213/images/cover.png)