Deconstructing LLMs: A Deep Dive into Scaling Sparse Autoencoders

Large Language Models (LLMs) like GPT‑4 are incredibly powerful—but they remain opaque. We can observe their inputs and outputs, yet the intricate internal computations within their hidden layers—the model’s “thoughts”—are largely a mystery. How can we begin to understand the concepts a model has learned, from simple ideas like the color blue to abstract notions like legal reasoning or risk assessment?

This challenge sits at the heart of mechanistic interpretability. A promising tool for this task is the Sparse Autoencoder (SAE). SAEs act as dictionaries for a model’s internal language: they decompose complex activations into a smaller number of interpretable features. A high‑quality dictionary can reveal meaningful internal features, letting us not only understand a model’s reasoning but potentially steer its behavior.

However, training SAEs has historically been difficult. They often fail to learn, suffer from “dead” latents that never activate, and become unstable at large scales. OpenAI’s paper, “Scaling and evaluating sparse autoencoders,” addresses these problems directly. The authors present a robust training recipe that allows SAEs to scale to unprecedented size—including a 16‑million‑latent autoencoder trained on GPT‑4 activations. They also derive clean scaling laws for SAEs and propose new ways to measure how good the discovered features actually are.

In this article, we’ll unpack these innovations: how the training recipe works, what the scaling laws reveal, and why evaluating feature quality is becoming central to interpretability research.

Background: Why Sparse Autoencoders Matter

An autoencoder consists of two parts:

an encoder, which compresses an input vector $x \in \mathbb{R}^d$ into latent variables $z$, and
a decoder, which reconstructs the original input from those latents, producing $\hat{x}$.

For interpretability, autoencoders are used not to compress data but to find a high‑dimensional dictionary of features—an overcomplete representation where the latent dimension $n$ is larger than the input dimension $d$. Since only a small subset of these features should activate for any one input, practitioners enforce sparsity.

The classic approach adds an $L_1$ penalty to the reconstruction loss:

\[ L = \|x - \hat{x}\|_2^2 + \lambda \|z\|_1 \]

This encourages most activations to be near zero, yielding sparse representations.

A standard ReLU autoencoder architecture. The encoder maps the input x to a latent vector z, and the decoder reconstructs x from z.

A typical ReLU autoencoder with reconstruction and sparsity objectives. Balancing between these terms is delicate and often unstable.

While workable, this method has two major flaws:

Hyperparameter fragility. The sparsity weight $\lambda$ must be tuned carefully; setting it too high ruins reconstruction, too low ruins sparsity.
Dead latents. Many latent variables stop activating altogether. At large scale, as much as 90 % of them may become unusable.

These problems have kept previous SAEs small and unreliable—too weak to capture the full diversity of features inside large LLMs.

A Better Training Recipe for SAEs

OpenAI’s work introduces a training approach that overcomes both issues and scales cleanly to millions of latents. Two ideas drive this success: direct sparsity control and latent regeneration.

1. Direct Sparsity Control with TopK

Instead of enforcing sparsity indirectly through an $L_1$ penalty, the authors adopt a k‑sparse activation function known as TopK:

\[ z = \text{TopK}(W_{\text{enc}}(x - b_{\text{pre}})) \]

For each input vector, the encoder computes a set of pre‑activations and retains only the $k$ largest values—zeroing all others. This guarantees exact sparsity and eliminates the need for tuning $\lambda$.

Advantages:

Explicit control. Sparsity is set directly by $k$, not by a hard‑to‑tune coefficient.
No activation shrinkage. Unlike $L_1$ penalties, TopK does not reduce the magnitude of non‑zero activations.
Better reconstruction. Empirically, it consistently delivers lower MSE for a given sparsity level and scales favorably with model size.

Comparison between TopK and other activation functions. (a) At a fixed number of latents, TopK (purple stars) achieves a better reconstruction–sparsity trade‑off. (b) As the number of latents grows, the gap between TopK and ReLU widens.

TopK improves reconstruction quality across sparsity levels and becomes increasingly advantageous in larger autoencoders.

2. Reviving Dead Latents

Dead latents waste capacity and compute. The OpenAI team found two synergistic ways to prevent them:

Tied initialization: The encoder weights start as the transpose of the decoder weights, placing the autoencoder at a stable initialization that encourages activation early in training.
Auxiliary loss (AuxK): Once a latent is deemed “dead” (does not activate over many tokens), the model reassigns it to reconstruct the residual error $e = x - \hat{x}$. This auxiliary objective gives otherwise inactive latents a fresh source of gradient information.

Together, these tricks nearly eliminate dead latents—even in the largest systems.

Methods for reducing dead latents. Blue curve: no mitigation (nearly all latents die). Orange: AuxK only. Green: tied initialization only. Red: both methods combined, keeping dead latents near zero.

Combining tied initialization and AuxK keeps almost all latents alive throughout training.

Scaling Laws for Sparse Autoencoders

With a stable training foundation, the authors could explore how SAEs scale—analogous to scaling laws for LLMs. Their results show strikingly clean power‑law relationships between reconstruction loss and factors such as compute, latent count, and sparsity.

Scaling laws for TopK autoencoders trained on GPT‑4 activations. (Left) Optimal reconstruction loss versus compute. (Right) Final loss at convergence as a function of total latents (n) and active latents (k).

Left: reconstruction loss improves predictably with compute. Right: joint scaling law showing combined effects of n and k.

They fit the following relationship for converged loss:

\[ L(n,k) = \exp(\alpha + \beta_k \log k + \beta_n \log n + \gamma \log k \log n) + \exp(\zeta + \eta \log k) \]

Both increasing the dictionary size $n$ and increasing sparsity level $k$ improve reconstruction, and the negative interaction term $\gamma$ indicates a mutually reinforcing relationship—the larger the dictionary, the more benefit comes from higher sparsity.

Scaling with Model Size

As one might expect, the difficulty of reconstruction grows with the complexity of the underlying LLM. Larger models require proportionally larger SAEs to achieve the same reconstruction quality.

Larger GPT‑4 family models demand more latents to reach equivalent MSE, holding k constant.

Bigger subject models require bigger autoencoders—mirroring scaling behavior in LLM pretraining.

Evaluating Latent Quality Beyond Reconstruction

Reconstruction loss alone does not measure how useful or interpretable features are. The paper introduces four metrics targeting qualitative aspects of feature quality and interpretability.

1. Downstream Loss

A high‑quality SAE should reconstruct the internal activations essential for language modeling. To test this, the researchers substitute the reconstructed activations back into GPT‑4 and measure its increase in next‑token prediction loss.

Comparison of downstream loss for different activation functions on GPT‑2 small. TopK (purple) yields smaller downstream loss for a given MSE.

TopK preserves downstream performance best—its reconstructions retain task‑relevant information.

Even their largest SAE (16 M latents) maintained GPT‑4 performance equivalent to a model trained with only 10 % less compute, indicating strong fidelity.

2. Recovering Known Features with Linear Probes

To see if the autoencoder discovers recognizable concepts, the team created 61 binary tasks (e.g., topic classification, language ID) and trained logistic probes on individual latents.

Probe loss and logit‑diff sparsity across total latents n and active latents k. Increasing n generally improves both metrics.

More total latents produce better probe scores and sparser causal effects.

Larger autoencoders recover hypothesized features more easily, confirming that scale improves interpretability.

3. Explainability: Precision and Recall

Interpretability can be misleading—an explanation may appear convincing but lack precision. To combat this, they use Neuron‑to‑Graph (N2G), an automated system that generates token‑pattern explanations for each feature, then measures precision and recall quantitatively.

Qualitative examples of N2G explanations. (a) high precision and good recall for “prank”; (b) low precision and poor recall.

Good features have explanations that are specific and reliable, not just broad correlations.

Across models, larger and sparser SAEs yield more precise, higher‑recall explanations.

4. Sparsity of Ablation Effects

Finally, each latent is ablated in turn to measure how sparsely its removal changes the model’s output logits. Intuitively, disentangled, monosemantic features should have highly localized effects.

Results show that increasing the total latent count $n$ and moderate increases in $k$ produce sparser and more interpretable causal effects—until a dense regime near $k \approx d_{model}$ makes features blend and lose clarity.

Understanding Why TopK Works

Beyond empirical gains, the paper explains why TopK outperforms traditional sparse activations.

Preventing Activation Shrinkage

$L_1$ penalties bias activations downward, reducing reconstruction quality. To test this, the authors “refine” activations after training by re‑optimizing their magnitudes while freezing the decoder.

$Refinement analysis comparing ReLU and TopK. (a) Distribution shifts for ReLU confirm shrinkage. (b,\u202fc) Refinement improves both reconstruction and downstream loss, closing the gap only partially.$

ReLU activations grow after refinement—evidence of prior shrinkage. TopK avoids this problem entirely.

TopK activations show no such bias, confirming that removing the $L_1$ term yields more faithful latent magnitudes and better overall performance.

Progressive and Flexible Codes

Standard TopK can “overfit” to its training sparsity level $k$. To allow flexible decoding at different sparsity levels, the authors design Multi‑TopK, training with combined losses at multiple values of $k$. The result is a progressive code that reconstructs well across sparsity ranges, retaining fidelity whether a fixed or adaptive number of latents is active.

Changing sparsity at test time with TopK(k) or JumpReLU(θ). Multi‑TopK generalizes smoothly across k.

Multi‑TopK yields a progressive code—robust reconstruction over diverse sparsity levels.

Implications and Future Directions

This research establishes a powerful foundation for scaling mechanistic interpretability.

Key takeaways:

Scalable and stable training. TopK activation, tied initialization, and AuxK create reliable SAEs with minimal dead latents.
Predictable scaling behavior. Reconstruction loss follows precise power laws across compute, latent count, and sparsity.
Improved interpretability at scale. Larger SAEs not only reconstruct better but also yield features that align with known concepts, explain cleanly, and exert sparse, intelligible effects on model outputs.

By showing that SAEs can scale successfully to millions of latents on frontier models like GPT‑4, the OpenAI team has opened a path toward comprehensive concept dictionaries—catalogs that may one day detail everything a large model knows.

The work also highlights future challenges: improving explanation algorithms, optimizing training efficiency, and exploring mixtures‑of‑experts or adaptive sparsity mechanisms. Yet the progress made here represents a substantial leap toward deconstructing the black box and genuinely understanding how advanced AI systems think.

Background: Why Sparse Autoencoders Matter#

A Better Training Recipe for SAEs#

1. Direct Sparsity Control with TopK#

2. Reviving Dead Latents#

Scaling Laws for Sparse Autoencoders#

Scaling with Model Size#

Evaluating Latent Quality Beyond Reconstruction#

1. Downstream Loss#

2. Recovering Known Features with Linear Probes#

3. Explainability: Precision and Recall#

4. Sparsity of Ablation Effects#

Understanding Why TopK Works#

Preventing Activation Shrinkage#

Progressive and Flexible Codes#

Implications and Future Directions#