Introduction: The Sparsity–Fidelity Dilemma

Mechanistic interpretability researchers have long sought to uncover how large language models (LMs) like Gemma or GPT-4 organize their internal representations. A powerful tool in this endeavor is the Sparse Autoencoder (SAE)—a model that decomposes dense activation vectors into simpler building blocks called features. Imagine an LM’s activation as its “thought” represented by thousands of numbers. An SAE reduces this complexity into components such as 70% “grammar” , 40% “computer code” , and 10% “formal writing” .

This sparse decomposition helps researchers trace how information flows through the model, understand causal subcircuits, and even steer behavior. Yet, every SAE faces a persistent conflict:

  1. Sparsity: To be interpretable, only a few features should be active per activation.
  2. Fidelity: To be useful, the reconstruction must closely match the original activation.

Increasing sparsity inevitably sacrifices fidelity, and improving fidelity usually increases feature usage. This trade-off forms a Pareto frontier—a curve of optimal balance that researchers strive to push outward.

A recent paper titled “Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders” introduces a simple but powerful method to do exactly that. The authors propose JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level. The breakthrough comes from a new way to train what was considered an untrainable activation function, enabling direct optimization for sparsity while maintaining high fidelity.

This article explains how JumpReLU SAEs work, why this innovation matters, and what it tells us about the future of feature-level interpretability in AI.


Background: A Quick Tour of Sparse Autoencoders

A standard autoencoder consists of two parts—an encoder that compresses input data into a lower-dimensional representation, and a decoder that reconstructs the original data.

In contrast, a sparse autoencoder expands rather than compresses. It learns an overcomplete dictionary of feature directions, where the encoder activates only a few entries to represent any given input.

Formally, for a language model activation \( \mathbf{x} \in \mathbb{R}^n \):

  1. Encoder: Transforms activations into features \( \mathbf{f}(\mathbf{x}) = \sigma(\mathbf{W}_{\text{enc}} \mathbf{x} + \mathbf{b}_{\text{enc}}) \)
  2. Decoder: Reconstructs the activation \( \hat{\mathbf{x}} = \mathbf{W}_{\text{dec}} \mathbf{f} + \mathbf{b}_{\text{dec}} \)

Encoder and Decoder equations for a Sparse Autoencoder.

“The encoder converts LM activations into feature magnitudes, and the decoder reconstructs the activation using learned dictionary vectors.”

Here, \( \mathbf{W}_{\text{enc}} \) and \( \mathbf{W}_{\text{dec}} \) are weight matrices, \( \mathbf{b}_{\text{enc}} \) and \( \mathbf{b}_{\text{dec}} \) are biases, and \( \sigma \) is a nonlinear activation—traditionally the ReLU. To train SAEs, researchers minimize a loss with two parts:

The general loss function for a Sparse Autoencoder.

“The overall SAE loss balances reconstruction fidelity (L2 error) and sparsity (how many dictionary features are active).”

The coefficient \( \lambda \) tunes how severely the model is penalized for being dense.


The Problem with Vanilla ReLU SAEs

ReLU-based SAEs historically use an L1 penalty for sparsity. But both the activation and the penalty cause unwanted side effects.

Consider the toy example below.

A toy model illustrating the problems with ReLU and the benefits of JumpReLU.

“False positives and magnitude shrinkage: ReLU keeps small positive activations that should be zero and punishes large activations, reducing reconstruction fidelity.”

When a feature’s encoder pre-activation should be inactive—but is slightly positive—the ReLU allows it through. Lowering the encoder bias can suppress these “false positives,” but it also shrinks real activations. The L1 penalty further encourages smaller values, systematically underestimating their true magnitude and degrading reconstructions.

Recent variants like Gated SAEs and TopK SAEs introduced extra thresholding mechanisms to control activation more precisely. JumpReLU SAEs take this idea further: they directly learn a threshold for every feature and optimize the true level of sparsity, not a proxy.


The Core Method: Taking the Leap with JumpReLU

The JumpReLU SAE combines two central ideas—a new activation function and a new training approach using direct L0 sparsity optimization.

1. The JumpReLU Activation Function

Rather than using ReLU, each feature now uses a JumpReLU activation defined as:

The JumpReLU activation function equation.

“JumpReLU replaces ReLU with a gated identity: zero below the threshold, identical above.”

A plot of the JumpReLU activation function. It is zero for inputs less than the threshold θ, and the identity for inputs greater than θ.

“Each feature has its own threshold θ: pre-activations below θ are set to zero, avoiding false positives and shrinkage.”

Formally, \( \text{JumpReLU}_\theta(z) = z \cdot H(z - \theta) \), where \( H \) is the Heaviside step (0 below 0, 1 above). When \( z \ge \theta \), the activation is untouched; otherwise, it’s zeroed out.

This preserves genuine activations while cleanly removing noise—a perfect solution to both false positives and shrinkage.

The full SAE forward pass becomes:

The forward pass equation for a JumpReLU SAE.

“JumpReLU SAEs extend standard SAEs with a per-feature threshold vector θ controlling activation.”

2. The L0 Loss and the Training Dilemma

Replacing the L1 penalty with a direct L0 sparsity term is conceptually ideal—the model should minimize how many features activate, not how strongly.

The JumpReLU SAE loss function with an L0 sparsity penalty.

“JumpReLU SAEs combine an L2 reconstruction loss with an exact L0 sparsity term.”

However, the challenge is clear: both the JumpReLU and the L0 norm are discontinuous in their thresholds. Slightly shifting \( \theta \) rarely affects outputs. The gradient with respect to \( \theta \) is therefore zero, breaking gradient-based training.

How do we train parameters with no gradient? Enter a key insight.

3. The Solution: Gradients of the Expected Loss

While the instantaneous loss is flat almost everywhere, the expected loss over the data distribution is differentiable. Its analytical gradient is:

The analytical derivative of the expected loss with respect to the threshold θ.

“The gradient of expected loss depends on how much reconstruction error and sparsity penalty would change for activations near the threshold.”

This equation says: adjust thresholds depending on how they affect average reconstruction quality and sparsity. If raising the threshold would remove features that contribute significantly to reconstruction, the model learns to lower it—and vice versa.

To compute this gradient from mini-batches, the authors introduce Straight-Through Estimators (STEs).

4. Straight-Through Estimators (STEs)

STEs approximate non-differentiable functions by replacing their true derivative (a Dirac delta spike) with a small nonzero window around the discontinuity.

The pseudo-derivative of the JumpReLU function with respect to its threshold. The pseudo-derivative of the Heaviside step function with respect to its threshold. A visual explanation of the pseudo-derivatives, which create a gradient signal in a small window around the threshold.

“For pre-activations near the threshold, pseudo-derivatives provide a gradient signal, effectively enabling backpropagation through the jump.”

By doing this during training, JumpReLU gains a smooth surrogate gradient. Remarkably, the authors prove that these pseudo-gradients correspond exactly to a kernel density estimator (KDE) of the true expected gradient—giving theoretical legitimacy to the technique.

In sum, JumpReLU makes the non-differentiable differentiable, enabling direct L0 training.


Experiments and Results: Putting JumpReLU to the Test

The authors evaluated JumpReLU against Gated and TopK SAEs on activations from several sites and layers of Gemma 2 9B, including the residual stream, attention outputs, and MLP outputs.

The Sparsity–Fidelity Trade-off

Plots comparing Delta LM Loss vs. L0 for JumpReLU, Gated, and TopK SAEs on the residual stream.

“Across Gemma 2 9B layers, JumpReLU achieves equal or better fidelity than TopK and Gated SAEs for any given sparsity.”

The metric, Delta LM Loss, measures how much the LM’s predictive accuracy is affected when reconstructed activations replace the true ones. Lower is better. In these plots, the green JumpReLU curve consistently sits below competitors—proving more faithful reconstructions at equal sparsity. Results hold across residual, MLP, and attention activations.


Learned Feature Dynamics

High-Frequency Features – Some architectures produce “always-on” features that activate on many tokens and are thus harder to interpret.

Plots showing the proportion of high-frequency features for each SAE type.

“JumpReLU and TopK SAEs display slightly more very frequent features than Gated SAEs—but these form less than 0.06% of the total dictionary.”

JumpReLU and TopK share similar high-frequency behavior, though most features remain rare and sparse. Crucially, JumpReLU avoids the “feature death” problem, needing no resampling during training.


Interpretability Studies

Manual Study: Human raters examined feature activations and explanations to judge monosemanticity—whether each feature represented a single coherent concept.

Bar chart of human rater scores for feature interpretability.

“Human raters found JumpReLU features just as interpretable as Gated or TopK features.”

All three SAE types were rated similarly, confirming that JumpReLU’s gains come without sacrificing interpretability.

Automated Study: Using Gemini Flash, researchers generated textual explanations for each feature and tested whether predicted activations matched the true ones.

Violin plots of Pearson correlation between simulated and ground truth activations.

“In automated evaluation, JumpReLU features correlate strongly with LM-simulated activations, competitive with or better than Gated and TopK SAEs.”

JumpReLU achieves meaningful alignment between textual descriptions and real activation patterns—validating its semantic clarity.


Conclusion: Why JumpReLU Matters

The JumpReLU Sparse Autoencoder steps beyond prior SAE designs, solving a long-standing issue in sparse training. By combining a thresholded linear activation with principled straight‑through gradient estimation, it delivers the following advantages:

  • State‑of‑the‑art fidelity: More accurate reconstructions at every sparsity level compared to Gated or TopK SAEs.
  • Efficiency: Uses simple elementwise operations—no expensive sorts or auxiliary losses—allowing faster training.
  • Interpretability: Retains human and automated clarity comparable to top-performing alternatives.
  • Theoretical foundation: Links STE training to the true gradient of expected loss via kernel density estimation.

The idea extends beyond JumpReLU itself. Training non‑differentiable models using gradients of expected loss could unlock new architectures—those that directly optimize discontinuous objectives like L0 sparsity or custom gating strategies.

As the community continues to decode how large models represent abstractions, techniques like JumpReLU bring us closer to faithful, efficient, and truly interpretable feature discovery—a real leap forward.