Reverse-Engineering AI: A Deep Dive into Automated Circuit Discovery

Modern AI models—especially large language models like GPT-4—are astonishingly capable. They can write code, summarize research papers, and explain complex ideas. Yet, despite these feats, we don’t really know how they do it. Inside each model lies a labyrinth of billions of parameters forming a network so intricate that its logic is essentially opaque. This has earned AI systems the label of “black boxes.”

Peering inside that box and uncovering the algorithms that drive model behavior is the mission of mechanistic interpretability. Researchers in this field aim to reverse-engineer neural networks—to translate the learned patterns and pathways of artificial neurons into understandable algorithms. But until recently, this process depended heavily on manual experimentation and intuition, with experts spending months tracing which internal components powered specific behaviors.

This manual detective work has proved effective but slow. As AI models grow larger and more sophisticated, it has become clear that we need a scalable way to make interpretability systematic. That’s where automation comes in.

The paper “Towards Automated Circuit Discovery for Mechanistic Interpretability” takes an ambitious step forward. It systematizes the workflow researchers use to find circuits—the internal computational pathways responsible for particular behaviors—and introduces an algorithm that performs one of the most time-consuming parts automatically. The method, called ACDC (Automatic Circuit DisCovery), can identify the interconnected components inside a neural network responsible for specific behaviors.

This blog explores how ACDC works, why it matters, and what its success means for the future of transparent AI.

Unpacking the Mechanistic Interpretability Workflow

Before automation can help, we must first understand the manual workflow. The authors identify three recurring steps that many successful interpretability projects follow when reverse-engineering circuits inside transformer models.

Step 1: Pin Down a Distinct Behavior

The first step is to isolate what you want to understand. A neural network can do thousands of things, but interpretability studies focus on one measurable behavior at a time. Consider GPT‑2 Small’s ability to solve the Greater-Than task—a deceptively simple capability. When prompted with:

“The war lasted from 1517 to 15”

GPT‑2 Small predicts a completion like “18” or “19,” correctly picking a year greater than 17. To analyze this behavior, researchers define:

A Behavior: A specific phenomenon to study (e.g., indirect-object identification, induction patterns, or year comparison).
A Dataset: A collection of prompts that repeatedly elicits that behavior, such as many year-pair sentences for Greater‑Than.
A Metric: A quantitative performance measure for the behavior—such as the probability difference between numbers higher and lower than 17.

By combining these elements, researchers can reliably trigger and study the model’s inner process for a single behavior.

A table showing six different tasks used to evaluate circuit discovery, including IOI, Docstring, Greater-Than, tracr-xproportion, tracr-reverse, and Induction, with example prompts, outputs, and metrics for each.

Table of benchmark tasks used by the authors to evaluate automated circuit recovery methods.

Step 2: Represent the Model as a Graph

Transformers aren’t just layers—they form a rich computational graph. In this graph, each node represents a component such as an attention head or MLP, and edges represent how information flows between them. Because of the Transformer’s residual stream, components can interact across both adjacent and distant layers, meaning earlier features can directly influence later computations.

The level of detail in this graph, known as granularity, can vary. At a coarse level we might treat each attention head as a node; at a finer level, we might split each head into its individual query, key, and value vectors. The choice depends on the desired interpretive resolution—higher detail provides deeper insights but increases computational cost.

Step 3: Isolate the Circuit Using Activation Patching

Once the behavior, dataset, and graph are defined, the key question is: which nodes and edges actually matter? To answer this, researchers use activation patching, also called interchange interventions.

The process unfolds like this:

Pair Inputs. Choose a clean input (one that triggers the behavior) and a corrupted input (one that doesn’t).
Capture Activations. Run the model on both inputs and record the internal activations—the numerical signals traveling along every edge.
Patch the Model. Run the model on the clean input again, but overwrite one node or edge’s activation with the corresponding activation from the corrupted input.
Measure Performance. Check how the metric changes. If performance drops sharply, that component was critical; if it barely changes, the component likely doesn’t contribute.

By recursively repeating this from the output layer backward, researchers can iteratively carve away unnecessary edges until only the essential subgraph—the circuit—remains.

This method is rigorous but excruciatingly time‑consuming. ACDC was designed to automate this exact step.

ACDC: Automating the Detective Work

ACDC translates the manual patch‑and‑test procedure into an algorithmic process that systematically prunes the computational graph. Starting at the output, it tests every connection to see if removing it significantly affects the model’s performance on the task.

A three-panel diagram illustrating the ACDC process. (a) A full computational graph is defined. (b) A single edge is tested by patching (cutting) it, and if performance doesn’t drop much, it’s pruned. (c) This process is repeated recursively until only the essential circuit remains.

Figure 2: A conceptual overview of the ACDC algorithm. It iteratively prunes unimportant connections until only the task-relevant circuit remains.

How It Works

Initialize the full graph: Start with all nodes and edges, denoted as \( H = G \).
Reverse order: Iterate from outputs back to inputs.
Test each edge: Temporarily remove one edge and perform a patched forward pass as in activation patching.
Measure impact: Compute the change in KL divergence, which measures how much the patched output distribution differs from the original model’s output.
Apply pruning rule: \[ D_{KL}(G || H_{\text{new}}) - D_{KL}(G || H) < \tau \] If this difference is smaller than a chosen threshold \( \tau \), the edge is considered unimportant and permanently removed.
Repeat recursively: Continue testing all edges until the remaining graph is the minimal circuit that retains the behavior.

The result: a dramatically reduced subgraph that preserves performance while exposing the structure of the algorithm the model implements.

Putting ACDC to the Test

To validate whether ACDC really identifies genuine circuits, the authors ask two questions:

Q1: Does ACDC find the true components underlying the modeled behavior?
Q2: Does it avoid including components unrelated to that behavior?

They answer these through two complementary experiments.

Experiment 1: Rediscovering Known Circuits

ACDC was tested on five behaviors where previous research had already mapped out circuits manually—such as Indirect Object Identification (IOI), Greater‑Than, and Docstring Completion. Each edge in the model was treated as either “in the circuit” or “out of the circuit.” By varying the threshold \( \tau \), the researchers created ROC curves plotting true‑positive vs. false‑positive rates.

Performance across tasks was compared with two other methods:

Subnetwork Probing (SP) – a gradient‑based masking technique
Head Importance Score for Pruning (HISP) – ranking and pruning heads by gradient magnitude

Five ROC curves for the IOI, tracr-reverse, tracr-xproportion, Docstring, and Greater-Than tasks, comparing the performance of ACDC, SP, and HISP in identifying known circuit edges.

Figure 3: ROC curves comparing ACDC, SP, and HISP. Curves nearest the top-left indicate better recovery of true circuit edges.

ACDC proved highly competitive—and in several cases, superior. For both IOI and Greater‑Than tasks, ACDC achieved the best area-under-curve scores. Most notably, for IOI, it precisely recovered all five component types previously discovered manually.

A side-by-side comparison. On the left, the vast, dense computational graph of GPT-2 Small, with the small IOI circuit highlighted in red. On the right, the clean, simple circuit diagram that ACDC automatically recovered.

Figure 1: The power of automated discovery. Left: GPT‑2 Small’s full computational graph. Right: the sparse IOI circuit recovered by ACDC—all components match the manually identified ones.

Experiment 2: Evaluating New Circuits Without Ground Truth

For tasks lacking a pre-mapped circuit, we need intrinsic measures of circuit quality. A good discovered circuit should:

Maintain fidelity: It should produce nearly the same outputs as the full model (low KL divergence).
Be concise: It should contain as few edges as possible (high sparsity).

These criteria were tested using the induction task, where the model predicts repeating sequence patterns (e.g., “A B … A → B”). ACDC, SP, and HISP were run and their outputs plotted by size vs. divergence.

A scatter plot showing the performance (KL divergence) vs. size (number of edges) of circuits found by ACDC, SP, and HISP on the induction task.

Figure 4: Evaluating recovered circuits on the induction task. Points in the lower-left indicate small, high-performance circuits. ACDC (red) forms the Pareto frontier, outperforming alternatives.

Above 20 edges, all points on the Pareto frontier come from ACDC. This means that for any given sparsity level, ACDC found the circuit with the best fidelity—a strong sign that it captures the model’s true computational structure efficiently.

Strengths, Limitations, and the Road Ahead

The results are compelling: ACDC can faithfully rediscover circuits previously found by hand, and it consistently produces efficient, high‑fidelity subgraphs. This is a step toward scaling mechanistic interpretability to modern large models.

However, automation comes with caveats. The authors note that:

ACDC’s performance can vary depending on hyperparameters and choice of metric.
It struggles to consistently find negative components—parts of circuits that actively counteract incorrect predictions.
Some tasks benefit from using zero activations instead of corrupted activations, suggesting the technique’s behavior depends on subtle design choices.

Despite these nuances, the overall takeaway is positive: ACDC works. It transforms a process that once required extensive human trial and error into an algorithm that can systematically explore and extract circuits.

Why This Matters

Mechanistic interpretability is about understanding why models behave the way they do. Automating this process will make it possible to explore larger models and more complex behaviors that were previously infeasible to study manually. This could help researchers:

Diagnose and debug misaligned behaviors in language models
Identify emergent patterns and algorithmic primitives inside networks
Design architectures with more interpretable inner workings
Build safer and more controllable AI systems

The authors have open‑sourced their code at github.com/ArthurConmy/Automatic-Circuit-Discovery, inviting researchers everywhere to test, extend, and refine ACDC.

Conclusion

The paper “Towards Automated Circuit Discovery for Mechanistic Interpretability” marks a pivotal advance in AI transparency. By codifying the established interpretability workflow and automating its most labor-intensive step, ACDC takes us closer to understanding neural networks not just as mysterious black boxes but as discoverable systems implementing comprehensible algorithms.

Human insight will always remain central—people must interpret what the recovered circuits mean. But now, algorithms can help with the heavy lifting, scanning complex models and revealing the subnetworks that make them tick.

As AI continues to evolve, this synergy between manual understanding and automated discovery will likely define the next frontier of interpretability research: faster, deeper, and ultimately, clearer views into the minds of machines.

Unpacking the Mechanistic Interpretability Workflow#

Step 1: Pin Down a Distinct Behavior#

Step 2: Represent the Model as a Graph#

Step 3: Isolate the Circuit Using Activation Patching#

ACDC: Automating the Detective Work#

How It Works#

Putting ACDC to the Test#

Experiment 1: Rediscovering Known Circuits#

Experiment 2: Evaluating New Circuits Without Ground Truth#

Strengths, Limitations, and the Road Ahead#

Why This Matters#

Conclusion#