Introduction
Have you ever watched a Large Language Model (LLM) generate a response and noticed a sudden, inexplicable shift in behavior? One moment it is solving a coding problem, and the next—in the blink of an eye—it is hallucinating or browsing for irrelevant images.
Consider a recent demo where an AI agent, tasked with coding, abruptly switched to Googling pictures of Yellowstone National Park. Or consider how “jailbreaking” attacks often succeed by manipulating just the first few tokens of a response, bypassing safety filters entirely. These aren’t random glitches. They are manifestations of a phenomenon known as critical windows.
A critical window is a narrow interval during the generation process where the model commits to a specific feature or outcome. Before this window, the output is undecided; after it, the path is set.
While this phenomenon has been observed empirically, understanding why it happens has been a challenge. Previous theoretical attempts were largely restricted to diffusion models and relied on complex statistical physics or heavy-handed assumptions.
In a new paper titled “Blink of an eye: a simple theory for feature localization in generative models,” researchers Marvin Li, Aayush Karan, and Sitan Chen provide a breakthrough. They offer a simple, rigorous, and unifying theory that explains critical windows across both diffusion models and autoregressive LLMs.

In this post, we will deconstruct their theory, explain the mathematics of feature localization, and explore how this single concept connects image generation, mathematical reasoning, and AI safety.
Background: The Forward-Reverse Experiment
To understand critical windows, we first need a way to measure when a “feature” (like the subject of a story or the class of an image) is actually decided by the model.
The researchers utilize a framework called Stochastic Localization Samplers. This is a fancy term for a broad category of generative models that includes:
- Diffusion Models: Which generate data by gradually removing Gaussian noise.
- Autoregression (LLMs): Which generate data by adding one token at a time.
In both cases, the generation process starts with high uncertainty (noise or an empty sequence) and ends with a specific output (a clear image or a complete sentence).
The Experiment
To locate the exact moment a feature emerges, the researchers use a Forward-Reverse Experiment.
Imagine you have a generated image of an orange cat.
- Forward Process: You gradually add noise to this image (or mask tokens in a text) up to a certain point \(t\).
- Reverse Process: You ask the model to regenerate (denoise or complete) the image from that noisy state.
If you add only a little noise (early in the forward process), the model will likely regenerate the same orange cat. The feature is “locked in.” However, if you add too much noise, the model might regenerate a brown cat, or even a dog. The information defining “orange cat” has been lost.

As shown in Figure 3, there is a “sweet spot”—or critical window—where the model remembers it should draw a cat but has forgotten it should be orange. This transition reveals exactly when the model decided on the color.
The Core Method: A Theory of Feature Localization
The primary contribution of this paper is a rigorous mathematical bound that predicts exactly when these windows occur. The authors frame this as a problem of distinguishing between sub-populations of a distribution.
Let’s stick with the cat analogy:
- \(\Theta\): The set of all possible images.
- \(S_{target}\): The subset of images that are cats (Orange or Brown).
- \(S_{init}\): The smaller subset of images that are specifically orange cats.
We want to know at what time \(t\) the model transitions from sampling from the broad group (\(S_{target}\)) to the specific group (\(S_{init}\)).
Defining the Boundaries
The researchers define two critical time points, \(T_{start}\) and \(T_{end}\), based on the Total Variation (TV) distance. The TV distance is a measure of how distinguishable two probability distributions are.
- \(T_{start}\): The latest time where the broader group (\(S_{target}\) - cats) is still distinguishable from everything else (\(\Theta - S_{target}\) - e.g., dogs).
- \(T_{end}\): The earliest time where the specific group (\(S_{init}\) - orange cats) becomes indistinguishable from the broader group (\(S_{target}\) - cats).
Mathematically, these boundaries are defined as:

Here, \(\mathbf{I}\) is the time index (steps in diffusion or tokens in LLMs).
- \(T_{start}\) captures when the model has committed to “Cat” but not “Orange Cat.”
- \(T_{end}\) captures when the “Orange” detail is lost to noise.
The Master Theorem
The paper’s main result, Theorem 2, proves that a critical window must exist between these two bounds. It states that if you run the forward-reverse experiment within the window \([T_{end}, T_{start}]\), the resulting distribution will look like the broader target population (\(S_{target}\)), but not necessarily the specific initial population (\(S_{init}\)).

This inequality is powerful because:
- It is Dimension-Independent: Unlike previous theories derived for high-dimensional diffusion, this bound does not degrade as the complexity (dimension) of the data increases.
- It is Universal: It applies to any stochastic localization sampler, meaning it works for both the continuous math of diffusion and the discrete math of text generation.
Visualizing the Transition
If we plot the probability of the model retaining a specific feature (like “is a cat”) against the noise level (or time), we see a sharp transition.

In Figure 2, the curve drops solely within the critical window. Before the window (\(T_{before}\)), the model generates “Cats and dogs” (uncommitted). After the window (\(T_{after}\)), it generates “Cats” (committed). The steepness of this slope indicates how suddenly the decision is made.
Instantiating the Theory
The authors apply their theorem to various specific models to show it holds up mathematically.
Diffusion Models
For a mixture of Gaussian distributions (standard theoretical models for diffusion), the authors derive explicit bounds. If the data consists of two distinct clusters (e.g., “Cat” vs “Dog”), the critical window is determined by the Hamming distance or separation between the means of those clusters.
The paper provides specific equations for these boundaries, showing they rely on the signal-to-noise ratio:

Autoregression (LLMs) & The “Random Walk”
The authors model mathematical problem solving in LLMs as a “random walk.” Imagine the model is taking steps on a number line. Reaching \(+A\) is a correct answer; reaching \(-A\) is incorrect.
- Strong Mode: The model takes a step toward the correct answer with probability \(0.5 + \delta\).
- Weak Mode: The model takes a step toward the correct answer with probability \(0.5 - \delta\).
The theory predicts that the “decision” to be in the Strong or Weak mode happens in a window of size \(\Theta(1/\delta^2)\). Crucially, this width is independent of the total length of the generation. This explains why a 1000-token response might have its quality determined by just a handful of tokens.
Hierarchies in Data
Real data isn’t just binary (Cat vs. Dog). It is hierarchical (Animal \(\to\) Mammal \(\to\) Cat \(\to\) Tabby). The authors extend their theory to Mixture Trees.
As the generation proceeds, the model traverses this tree from root to leaf. Each split in the tree corresponds to a critical window.

In Figure 4, we see this empirically. The authors asked LLAMA-3 to complete sentences like “The (Pirate/Ninja) jumped…” The plots show distinct jumps in consistency. Each jump represents the model committing to a specific branch of the hierarchy (e.g., committing to “Pirate” over “Ninja”).
Experiments: Critical Windows in the Wild
The theory is elegant, but does it predict real-world behavior? The authors conducted extensive experiments using state-of-the-art LLMs (LLAMA-3, Phi-3, Qwen-2.5) on complex reasoning tasks.
Reasoning “Chain of Thought”
When an LLM solves a math problem, does it gradually converge on the answer, or does it have an “aha!” moment?
The experiments suggest the latter. By masking the end of a Chain of Thought (CoT) and regenerating, the authors found distinct steps where the probability of getting the correct answer jumps significantly.

In Figure 6, notice the sharp jump around the 40% mark. This corresponds to the exact step in the reasoning process where the model writes down the correct formula. Once that formula is generated, the final answer is effectively locked in.
Critical Windows and Failure
Perhaps the most practical insight from this paper is the correlation between critical windows and errors. The researchers found that generations containing sharp critical windows were significantly less accurate than those without.

As shown in Figure 8 (and Figure 5/9/10 in the full deck), the orange bars (generations with critical windows) consistently show lower accuracy than the blue bars. This suggests that when a model struggles or “hesitates” (manifesting as a sharp branching point), it is more likely to hallucinate or err. Conversely, robust knowledge retrieval tends to happen more smoothly.
Jailbreaks and Safety
The theory also explains “jailbreaks”—adversarial attacks that bypass safety filters. A jailbreak creates a critical window at the very beginning of the generation.

The left plot in Figure 7 shows a “Prefill attack.” By forcing the model to start its response with a specific affirmative phrase (“Sure, here is how to…”), the attacker forces the model through a critical window where it transitions from the “Refusal” distribution to the “Compliance” distribution. The theory validates why these attacks are so effective: once the model is pushed through that narrow window, the features of the “Safety” sub-population are statistically indistinguishable from the “Harmful” sub-population in the context of the generated prefix.
Conclusion
The paper “Blink of an eye” provides a much-needed theoretical backbone for understanding the unpredictable nature of generative AI. By framing generation as a process of stochastic localization, the authors show that critical windows are not bugs, but mathematical necessities when localizing from a broad distribution to a specific one.
Key Takeaways:
- Universality: The phenomenon applies to both Diffusion images and LLM text.
- Predictability: We can mathematically bound when these decisions occur (\(T_{start}\) vs \(T_{end}\)).
- Interpretability: Identifying these windows allows us to pinpoint exactly where a model learns a concept, makes a reasoning leap, or succumbs to a jailbreak.
For practitioners, this implies that safety and alignment interventions should not be applied uniformly. Instead, they should target these specific, narrow windows where the model’s “mind” is actually being made up. The future of interpretable AI may lie in watching for these blinks.
](https://deep-paper.org/en/paper/2502.00921/images/cover.png)