Slimming Down Transformers: Revealing Hidden Circuits with Edge Pruning

Large Language Models (LLMs) like GPT-4 and Llama are remarkably powerful—but equally mysterious. We can use them to write essays, generate code, and solve puzzles, yet we rarely know how they reach their conclusions. This “black box” nature makes it difficult to build safer and more reliable AI systems.

Mechanistic interpretability aims to open that box. Instead of treating models as opaque, it studies how internal components—like attention heads and MLPs in Transformers—work together to perform computations. A central idea in this field is the concept of circuits: sparse, focused subgraphs of a model that capture specific behaviors.

Imagine isolating the one tiny circuit in your car responsible for turning on the windshield wipers. If we can identify the corresponding subgraph inside a Transformer responsible for linguistic tasks—like resolving pronouns—we can study, debug, and even improve it.

Finding these circuits, however, is not easy. Earlier methods required manual examination, which doesn’t scale. Automated tools such as ACDC and EAP have advanced the field, but they are either too slow for large models or too approximate to be reliable.

A recent paper from researchers at Princeton University, “Finding Transformer Circuits with Edge Pruning,” proposes a more elegant and scalable solution. Their approach reframes circuit discovery as an optimization problem using methods borrowed from model pruning. The resulting algorithm, called Edge Pruning, finds lean, high-quality circuits efficiently—and, for the first time, scales to a 13-billion-parameter model.


Understanding Circuits and the Challenge of Finding Them

Transformers process information through a sequence of layers—alternating between attention and MLP modules—all connected by a residual stream. At each layer \(i\), the model updates its internal state via

\[ h_{i+1} = h_i + f_i(h_i), \]

where \(h_i\) is the residual stream and \(f_i\) is the operation performed by that layer.

We can view these layers and their interactions as a computational graph. Each component (attention head or MLP block) corresponds to a node, and the connections between them correspond to edges. A circuit is a subset of these edges—a sparse version of the full model graph—that still performs a specific function.

A standard Transformer adds outputs directly to the residual stream. Edge Pruning disentangles this stream to allow learnable masks that identify key edges.

Figure 1: The central concept of Edge Pruning. Instead of a dense Transformer (a), learnable masks are optimized along model edges (b), yielding a sparse yet faithful circuit (c).

To determine which edges matter, interpretability researchers use a causal technique called interchange ablation (also known as activation patching). The process involves:

  1. Running a clean input (e.g., “Mary gave the ball to John.”) through the model, saving all activations.
  2. Running a corrupted input with small changes (e.g., “Amy gave the ball to David.”), which should yield a different output.
  3. Patching activations from the corrupted input into specific edges of the clean input run.
  4. Measuring how much the model’s output changes—for example, if the prediction flips from John to David, that edge is important.

The ultimate goal is to find a sparse subgraph \( \mathcal{C} \) where the circuit’s output distribution \( p_{\mathcal{C}}(y | x, \tilde{x}) \) closely matches the full model’s \( p_{\mathcal{G}}(y | x) \). The objective is to minimize their divergence while maintaining sparsity:

\[ \arg\min_{\mathcal{C}} \mathbb{E}_{(x, \tilde{x})}\left[D\left(p_{\mathcal{G}}(y | x) \parallel p_{\mathcal{C}}(y | x, \tilde{x})\right)\right], \quad \text{subject to } 1 - \frac{|\mathcal{C}|}{|\mathcal{G}|} \ge c. \]

Limitations of Previous Methods

ACDC uses a greedy search: testing each edge individually through ablation. This exhaustive process is accurate but prohibitively slow, especially for billion-parameter models.

EAP (Edge Attribution Patching) uses a linear, gradient-based approximation to estimate edge importance. It’s faster but neglects edge dependencies, making it less accurate on complex tasks.

Edge Pruning solves these issues by converting the discrete search into a continuous optimization, combining the accuracy of direct testing with the efficiency of gradient-based learning.


The Edge Pruning Method

Edge Pruning introduces trainable masks over each edge in the Transformer’s computational graph. Rather than removing entire neurons, it selectively prunes connections between components.

For every edge \( j \to i \), a continuous mask \( z_{ji} \in [0,1] \) determines whether the signal from node \( j \) influences node \( i \). During training:

  • \( z_{ji} = 1 \): the edge is active, using clean activation \( y_j \).
  • \( z_{ji} = 0 \): the edge is pruned, replaced by a corrupted activation \( \tilde{y}_j \).
  • Intermediate values interpolate between the two.
\[ y_i = f_i\!\left(z_{0i}y_0 + (1 - z_{0i})\tilde{y}_0 + \sum_{jThis architecture allows Edge Pruning to perform gradient descent on all masks simultaneously, learning which edges are necessary to reproduce model behavior.

The Disentangled Residual Stream

A challenge arises: each component now requires a unique combination of previous activations based on its masks. To enable this flexibility, the model’s residual stream is disentangled—instead of maintaining a single vector, it stores a list of all prior activations. When a new layer begins, it dynamically aggregates the relevant inputs weighted by its edge masks. This increases memory requirements but is crucial for accurate, fine-grained pruning.

Optimizing Sparse Circuits

To encourage sparsity, the paper applies \( L_0 \) regularization—a technique that penalizes nonzero masks. Because \( L_0 \) is nondifferentiable, the authors use a hard concrete distribution, which allows masks to be trained continuously while pushing their values toward binary extremes:

\[ \begin{array}{c} u \sim \text{Uniform}(\epsilon, 1-\epsilon), \quad s = \sigma\!\left(\frac{1}{\beta}\log\frac{u}{1-u} + \log\alpha\right), \\ \tilde{s} = s(r - l) + l, \quad z = \min(1, \max(0,\tilde{s})). \end{array} \]

To reach a desired sparsity target \( t \), a Lagrangian term adjusts training pressure:

\[ \mathcal{L}_s = \lambda_1 (t - s) + \lambda_2 (t - s)^2. \]

The final loss combines this regularization with faithfulness loss (KL divergence from the full model). After training, continuous masks are thresholded to produce binary connections—defining the final sparse circuit.


Putting Edge Pruning to the Test

The authors evaluated Edge Pruning on four standard tasks for circuit discovery using GPT-2 Small and compared it to ACDC and EAP.

Measuring Faithfulness

Faithfulness quantifies how closely the circuit mirrors the full model’s behavior, measured by KL divergence—the smaller, the better.

KL divergence vs. edge sparsity for ACDC, EAP, and Edge Pruning.

Figure 2: On complex tasks like Indirect Object Identification (IOI) and Greater Than (GT), Edge Pruning (green) achieves markedly lower KL divergence—meaning higher faithfulness—than ACDC (blue) and EAP (orange).

On simple tasks such as gendered pronoun resolution, Edge Pruning is competitive. On more intricate tasks like IOI and Greater Than (GT), it dramatically outperforms other approaches, maintaining high faithfulness even at heavy sparsity.

Circuit Performance

Having faithful circuits is valuable—but do they still perform the task correctly?

Performance comparison across methods. Higher is better.

Figure 3: Edge Pruning yields circuits that maintain or exceed task performance while being substantially sparser.

Performance data confirms Edge Pruning’s advantage. On IOI, it achieved 98.8% sparsity with the same performance as ACDC’s circuit at 96.8% sparsity. This means it accomplished equal fidelity using 2.65× fewer edges, giving cleaner and more interpretable circuits.

Scaling with Data

To test scalability, the authors expanded the IOI dataset from hundreds to 100,000 examples. Runtime and faithfulness measurements (Table 1) reveal that Edge Pruning not only handles large datasets but thrives at scale.

MethodSparsity (%)↑KL ↓ (200)Time (s) ↓KL ↓ (400)Time (s) ↓KL ↓ (100K)Time (s) ↓
ACDC96.6 ± 0.10.9218,7830.8840,759
EAP96.6 ± 0.13.47213.66433.7812,260
Edge Pruning96.6 ± 0.10.252,7560.222,9310.203,042

Table 1: Edge Pruning scales efficiently to 100K examples—faster and far more faithful than prior methods.

ACDC improves marginally with extra data but is slow. EAP remains speedy yet inaccurate. Edge Pruning offers both efficiency and quality, proving suitable for large-scale interpretability problems.

Recovering Ground-Truth Circuits

The authors further validated against Tracr, a framework that compiles human-written algorithms into tiny Transformer models. These “compiled Transformers” have known ground-truth circuits—perfect test cases.

Ground-truth circuits recovered by Edge Pruning.

Figure 4: Edge Pruning successfully reconstructed ground-truth circuits for two Tracr-compiled programs: proportional counting and list reversal.

Edge Pruning recovered both circuits flawlessly, confirming that its optimization procedure reliably discovers genuine underlying mechanisms.


Scaling to 13 Billion Parameters: CodeLlama Case Study

Most interpretability tools struggle with models beyond a few hundred million parameters. To demonstrate practical scalability, the researchers applied Edge Pruning to CodeLlama-13B, over 100× larger than GPT-2.

Task Setup

The study examined how CodeLlama processes Boolean Expressions (e.g., “((not False) and True) or False is → False”). The authors compared two prompting styles:

  • Instruction-Prompted (IP): model given a direct instruction.
  • Few-Shot (FS): model given examples.

Edge Pruning was applied separately to each setting to identify the circuits responsible for reasoning.

CircuitNum. edges ↓Accuracy (%) ↑Exact Match (%) ↑
IPFSIPFS
Full Model3,872,82082.0089.25100.00100.00
Instruction Prompt (IP)1,04179.2574.5090.0079.00
Few-Shot (FS)1,46475.7587.2584.5091.25
IP ∩ FS65372.5068.2579.7572.50

Table 2: Edge Pruning isolates circuits with over 99.96% sparsity that closely match full-model performance.

Key Findings

  1. Extreme Sparsity: Circuits retained <0.04% of the model’s edges yet performed within a few percent of the original model.
  2. Shared Mechanisms: The two circuits overlap substantially—62.7% of edges in common—indicating similar underlying reasoning for few-shot and instruction-driven behavior.
  3. Cross-Setting Robustness: The few-shot circuit performs well even under instruction prompting, suggesting a unified internal mechanism.

This case study represents a breakthrough: scalable interpretability at the trillion-operation scale, revealing that the same neural “wiring” likely governs different prompting paradigms in large models.


Implications and Outlook

Edge Pruning marks a major milestone in mechanistic interpretability:

  • Accuracy: It identifies more faithful circuits than earlier methods.
  • Efficiency: It scales linearly with data and remains tractable for large models.
  • Validity: It perfectly recovers known circuits, ensuring correctness.
  • Scalability: It operates at unprecedented model sizes, revealing interpretable structure in 13B-parameter Transformers.

The method’s main limitation lies in memory overhead—disentangling the residual stream demands extra GPU resources. Moreover, even highly sparse circuits in massive models can still contain thousands of edges, leaving full manual interpretation challenging. Still, these are trade-offs worth making: we gain unprecedented visibility into inner model mechanics.

As models grow ever larger, tools like Edge Pruning will be essential for helping researchers see how information flows and decisions form inside neural networks. The ability to find and examine circuits—at any scale—moves us closer to a future where we not only use AI effectively but understand it deeply.