Large Language Models (LLMs) like GPT‑4 can write poetry, debug code, and explain complex scientific concepts. But if you ask how they do it, the answer is often a shrug. These models are famously “black boxes”—massive networks filled with billions of parameters, where intricate computations occur beyond human comprehension.
The field of mechanistic interpretability seeks to change that. Its goal is to reverse‑engineer LLMs—not just observing what they output, but understanding how they compute those outputs internally. Central to this effort is the idea of a circuit: a sparse subgraph of neurons and weights responsible for a specific behavior, such as identifying an indirect object or recognizing an induction pattern.
Yet circuit analysis faces a major challenge when traversing the Multi‑Layer Perceptron (MLP) sublayers within Transformer models. Neurons in MLPs are often polysemantic—one neuron might activate for both “Polish surnames” and “chess moves.” Tracing a single interpretable computation through this tangled web becomes nearly impossible.
A promising workaround has been to shift focus from individual neurons to features—directions in a model’s activation space discovered using tools like Sparse Autoencoders (SAEs). These provide meaningful abstractions but introduce a new problem: while one can identify interpretable features before and after an MLP layer, it’s hard to describe the general rule that connects them. The connection changes depending on the input.
This is the puzzle tackled by Transcoders Find Interpretable LLM Feature Circuits. The researchers introduce transcoders, a novel architecture that preserves interpretability while unlocking a powerful new way to analyze how features connect through MLPs. Transcoders allow analysts to separate two kinds of information: the timeless, input‑invariant structure of the model and the context‑specific, input‑dependent activations within it.
Understanding the Transformer and the MLP Problem
Transformers process text through a stack of layers, each containing two main sublayers: an attention sublayer and an MLP sublayer.
- The attention sublayer lets each token “look at” other tokens, gathering context.
- The MLP sublayer operates independently on each token’s hidden representation.
Every layer adds its outputs to the residual stream—a shared information pathway flowing through the model. Attention and MLP results are added to this stream rather than replacing it, meaning each token’s hidden state accumulates all prior computations.

Figure 1: A Transformer layer with parallel Attention and MLP paths. SAEs reconstruct activations, while transcoders imitate the MLP’s input‑output behavior.
This additive structure aids training but complicates reverse engineering. The hidden state at any point is a superposition of all preceding computations. In the MLP sublayer, polysemantic neurons generate further entanglement; a single concept is scattered across thousands of neurons.
Sparse Autoencoders (SAEs)
SAEs mitigate polysemanticity by representing activations as a sparse linear combination of “feature” vectors. Each feature corresponds to an interpretable concept like English surnames or times of day. Instead of analyzing thousands of neurons, researchers can analyze hundreds of meaningful features.
But SAEs have a shortcoming—a lack of input invariance. A connection found between two SAE features (for example, a “Polish surname” feature and a “citation” feature) may hold for one input but not universally. Due to nonlinearities in the intervening MLP, attributions fluctuate from input to input. We cannot reliably describe a general rule that applies across all data.
Transcoders address this gap.
How Transcoders Work
A transcoder is structurally similar to an SAE: a wide, one‑hidden‑layer ReLU MLP. The difference lies in the objective. While SAEs reconstruct their input, a transcoder is trained to imitate the forward computation of the original MLP sublayer itself—mapping from the MLP’s input to its output.
Mathematically:
\[ \mathbf{z}_{TC}(\mathbf{x}) = \mathrm{ReLU}(W_{enc}\mathbf{x} + b_{enc}) \]\[ TC(\mathbf{x}) = W_{dec}\mathbf{z}_{TC}(\mathbf{x}) + b_{dec} \]Here, \( \mathbf{x} \) is the MLP’s input. \( W_{enc} \) encodes it to a sparse activation vector \( \mathbf{z}_{TC} \), and \( W_{dec} \) decodes those activations to reconstruct the MLP’s output.
The transcoder’s training loss combines two terms:
\[ \mathcal{L}_{TC}(\mathbf{x}) = \underbrace{\|\mathrm{MLP}(\mathbf{x}) - \mathrm{TC}(\mathbf{x})\|_2^2}_{\text{faithfulness loss}} + \underbrace{\lambda_1 \|\mathbf{z}_{TC}(\mathbf{x})\|_1}_{\text{sparsity penalty}} \]The faithfulness term ensures the transcoder closely matches the original MLP’s output; the sparsity term keeps activations interpretable.
Separating Input‑Dependent and Input‑Invariant Behavior
Once trained, transcoders reveal a striking property: their attributions for how one feature influences another can be factored into input‑dependent and input‑invariant parts.
For transcoder feature \(i\) in layer \(l\) influencing feature \(i'\) in layer \(l'\):
\[ \underbrace{z_{TC}^{(l,i)}(\mathbf{x}_{mid}^{(l,t)})}_{\text{input‑dependent}} \underbrace{(\mathbf{f}_{dec}^{(l,i)} \cdot \mathbf{f}_{enc}^{(l',i')})}_{\text{input‑invariant}} \]- The first term depends on the current input: how strongly feature \(i\) is active.
- The second term is purely structural—a constant dot product between decoder and encoder vectors—revealing how features connect in general across all inputs.
This clean separation enables input‑independent circuit analysis: identifying stable wiring between concepts inside the MLP.
Discovering Circuits Automatically
Using the transcoder’s attribution formula, the authors introduce a recursive algorithm to find complete feature circuits across the model.

Figure 2: Progressive pruning to extract the most influential feature paths across layers and tokens.
High‑level steps:
- Begin with a target feature (e.g., a “citation semicolon” feature).
- Compute contributions from earlier‑layer features to this feature’s activation.
- Retain only the top‑\(k\) contributors.
- For each retained feature, repeat the process backward through layers.
- Merge all paths into a single graph that represents an interpretable circuit.
This automated circuit discovery extends beyond previous SAE or gradient‑based approaches: it builds a sparse, hierarchical map of how computations flow through the network.
Are Transcoders as Effective as SAEs?
A new interpretability tool is only valuable if it maintains or improves fidelity and clarity relative to existing ones. The authors systematically compared transcoders and SAEs across interpretability, sparsity, and faithfulness.
Interpretability Study
To test human interpretability, they ran a blind evaluation on features from the Pythia‑410M model. Fifty random transcoder features and fifty SAE features were mixed and examined without revealing their origin. Evaluators judged whether each feature corresponded to a coherent concept based on top‑activating text examples.

Table 1: Transcoder features were slightly more interpretable and fewer were uninterpretable compared to SAE features.
Transcoders produced a small but consistent improvement—more clearly interpretable features and fewer opaque ones—suggesting that imitating MLP functions preserves semantic clarity.
Sparsity–Faithfulness Trade‑Off
The team then evaluated how well each method approximates the model’s original performance. They compared sparsity (average number of active features, \(L_0\)) versus the model’s next‑token loss when replacing MLP layers with their learned approximations.

Figure 3: Transcoders (orange) match or outperform SAEs (blue) on fidelity at comparable sparsity across model scales.
Across GPT‑2 Small, Pythia‑410M, and Pythia‑1.4B, transcoders were as faithful—or slightly better—than SAEs at equivalent sparsity levels. As models scale, the advantage widens, indicating that transcoders generalize effectively to large architectures.
Case Studies: Transcoders in Action
With a validated toolset, the authors applied transcoders to interpret previously mysterious circuits. Two highlights reveal their capabilities.
1. Blind Case Study – The Citation Feature
The team conducted blind case studies: identifying what a feature represents without ever looking at actual text, relying solely on circuit and de‑embedding analysis.
They targeted feature tc8[355] in GPT‑2 Small. The circuit‑finding algorithm identified lower‑layer contributors linked to:
- Semicolon tokens (“;”)
- Years (“1973”, “1967”)
- Surnames (“kowski”, “chenko”, “Burnett”)
- Parentheses (“(”)
Combining these clues, they hypothesized the pattern (Surname Year;)—a semicolon within parenthetical citations.
Subsequent inspection of activating examples confirmed it precisely:
instances such as (Poeck, 1969; Rinn, 1984) and (Robinson et al., 1984; Starkstein et al., 1988) triggered the feature strongly.
This validated transcoders as tools for feature discovery even without textual cues.
2. The “Greater‑Than” Circuit in GPT‑2 Small
The classic “greater‑than” problem asks how GPT‑2 predicts that in “The war lasted from 1737 to 17…”, the next token must form a number greater than 37. Earlier studies identified MLP layer 10 as crucial; the transcoder method revisited it.
Using circuit attribution and Direct Logit Attribution (DLA), the authors pinpointed specific transcoder features responsible for comparing numbers.

Figure 5: Each feature activates on a range of two‑digit year tokens and boosts logits for subsequent years.
These features amplified the probabilities of numbers greater than the one they activated on—for example, firing on “45” increased likelihoods for “46”, “47”, etc.—revealing a structured, interpretable mechanism underlying numeric comparison.
The researchers then compared circuits composed of transcoder features versus raw neurons:

Figure 4: Transcoder circuits recover similar performance using far fewer components than raw neuron circuits.
With only 10–20 transcoder features, the model recovered most of its original “greater‑than” behavior—far less than the 40–60 neurons needed for comparable performance. This demonstrates transcoders’ advantage in sparsity and conceptual clarity.
Conclusion
The paper Transcoders Find Interpretable LLM Feature Circuits marks a major advance in mechanistic interpretability. By training sparse, wide MLPs to imitate original sublayers, transcoders bridge a long‑standing gap between feature discovery and circuit understanding.
Key takeaways:
- They disentangle input‑invariant structure (model wiring) from input‑dependent activation (context effects).
- They maintain or exceed SAE‑level interpretability and faithfulness.
- They enable automatic circuit discovery across both MLPs and attention heads.
Through examples like the citation and greater‑than circuits, transcoders reveal how intricate language model computations can reduce to small, understandable feature graphs.
Transcoders are not the final word—scaling to larger models and refining their approximations remain open challenges—but they provide a powerful new lens for viewing the inner workings of LLMs. As research progresses, this approach may transform interpretability from an art of intuition into a science of transparent algorithms.
](https://deep-paper.org/en/paper/2406.11944/images/cover.png)