Large Language Models (LLMs) like GPT-4 or Llama are often described as “black boxes.” We know they work—they can write poetry, debug code, and tell you the capital of France—but we don’t fully understand how they store that information. When an LLM “knows” that a cafe is a type of restaurant, is that fact stored in a specific cluster of neurons? Or is it smeared across the entire network like jam on toast?
If we could answer this question, the implications would be massive. We could surgically correct factual errors without expensive retraining. We could remove toxic knowledge or dangerous capabilities while leaving the rest of the model’s intelligence intact.
This is the core problem addressed in the paper “Discovering Knowledge-Critical Subnetworks in Pretrained Language Models” by researchers at EPFL. They investigate whether specific pieces of relational knowledge are encoded by sparse subnetworks—tiny, specific subsets of the model’s parameters.
In this deep dive, we will walk through their methodology for finding these “knowledge neurons,” the math behind their differentiable masking technique, and what happens to a model when you surgically remove a specific fact.
The Hypothesis: Knowledge is Localized (Sort of)
The central hypothesis of this paper is that relational knowledge (facts like “A car is a vehicle”) isn’t just random noise in the weights. The authors propose that for any given set of knowledge, there exists a knowledge-critical subnetwork.
This is a specific, sparse computational subgraph within the model. If you remove it, the model should lose the specific knowledge it encodes, but—and this is crucial—it should function perfectly fine otherwise.

As illustrated in Figure 1, the goal is a targeted “lobotomy.”
- Target Knowledge (TARGETKG): The specific facts we want to suppress (e.g., “A cafe is a type of restaurant”).
- Control Knowledge (CONTROLKG): Related facts we want to keep (e.g., “A car is a vehicle”).
- Language Modeling (CONTROLLM): General ability to speak English (e.g., grammar, syntax).
If the researchers can find a set of weights (marked with red Xs in the figure) that, when removed, destroys the Target flow but leaves the Purple and Orange flows intact, they have found a knowledge-critical subnetwork.
The Background: Knowledge Graphs and Triplets
Before we look at the neural network surgery, we need to understand how the researchers define “knowledge.” They rely on Knowledge Graphs (KGs). A KG represents facts as triplets: (Head, Relation, Tail).

For example, (Paris, IsA, City). To feed this into a language model, these triplets are “verbalized” into natural language sentences like “Paris is a city.” The model is then tested on its ability to predict the tail token (e.g., “city”) given the context “Paris is a…”.
The researchers used datasets from WordNet and ConceptNet to create these target knowledge sets.
The Core Method: Differentiable Masking
How do you find a needle in a haystack when the haystack has 1.5 billion straws (parameters)? You can’t manually test every weight. You need an automated way to search the network.
The authors use a technique called Differentiable Masking.
1. The Mask
For a pre-trained model with parameters \(\theta\), the researchers want to learn a binary mask \(m\) consisting of 0s and 1s.
- If \(m_i = 1\), the weight is kept.
- If \(m_i = 0\), the weight is removed (pruned).
The pruned network behaves as \(f(x, m \odot \theta)\), where \(\odot\) is element-wise multiplication.
2. Making it Learnable (The Gumbel-Softmax Trick)
The problem is that you can’t use standard backpropagation (gradient descent) on binary values (0 or 1). A step function isn’t differentiable. To solve this, the researchers assign a real-valued score \(l_i\) to every parameter and use the Gumbel-Softmax distribution to simulate a binary choice while allowing gradients to flow.
They calculate a continuous score \(s_i\) (probability of being 1):

Then, to actually apply it as a hard mask during the forward pass, they use a “straight-through estimator.” This rounds the score to 0 or 1 for the calculation but keeps the gradients of the continuous score for the backward pass.

This clever mathematical trick allows the system to “learn” which weights to turn off using standard training loops.
The Three Pillars of the Objective Function
To find the correct subnetwork, the researchers train the mask using a loss function composed of three distinct goals. This is the heart of the paper’s contribution.
Objective 1: Suppression (Kill the Target Knowledge)
The primary goal is to make the model “forget” the target facts. The researchers achieve this by maximizing the difference between the model’s prediction on the target knowledge and its original confident predictions. Specifically, they minimize the KL Divergence between the model’s output and a uniform distribution.
In simple terms: When asked “A cafe is a…”, the model shouldn’t say “restaurant.” It should be as confused as if it were guessing randomly from the dictionary.

Objective 2: Maintenance (Don’t Break the Brain)
If we only optimized for suppression, the mask might just turn off every weight in the model. A lobotomized model knows nothing, which technically solves the suppression task!
To prevent this, they add a maintenance loss. The pruned model must behave exactly like the original pre-trained model on Control Knowledge and general Language Modeling data.

This forces the mask to be surgical. It can cut the “cafe” connection, but it cannot touch the “grammar” or “car” connections.
Objective 3: Sparsity (Keep it Small)
Finally, they want the smallest possible subnetwork. They add a regularization term that penalizes the number of weights kept.

The Final Formula
Combining these, the final loss function balances these competing interests using lambda (\(\lambda\)) weights:

Experimental Results: Did it work?
The researchers tested this method on GPT-2 models of various sizes (Small, Medium, Large, XL). They compared two approaches:
- Weight Masking: Pruning individual connections (weights).
- Neuron Masking: Pruning entire neurons (all weights connecting to a specific neuron).
Result 1: Weight Masking is King
The results were compelling. As shown in Table 2, the Weight Masking method discovered extremely sparse subnetworks—keeping nearly 98.6% of the model intact (meaning the “knowledge” was held in less than 1.4% of the weights).

Look at the TARGETKG \(\Delta\)PPL column. This measures how much “worse” (higher perplexity) the model became at the target knowledge. A score of 590.9 means the model became completely clueless about the target facts.
Crucially, look at CONTROLKG \(\Delta\)PPL. It is near zero (-0.2). This means the model’s ability to understand other facts remained virtually unchanged.
Neuron Masking (masking whole neurons) was much less effective. It struggled to separate the target knowledge from the control knowledge, suggesting that individual neurons are “polysemantic”—they hold multiple different concepts at once. You can’t kill a neuron to remove a fact without killing other facts too. Individual weights, however, offer the fine-grained control needed.
Result 2: It Scales
The method wasn’t just a fluke on small models. The researchers tested it on GPT-2 Large and XL and found consistent results.

In Table 16, we see that even for GPT-2 XL (1.5 billion parameters), the method successfully spiked the perplexity on target knowledge (high \(\Delta\)PPL) while keeping control knowledge stable.
Anatomy of a Memory: Where is the Knowledge?
So, where were these “knowledge-critical” weights hiding? The authors analyzed the structure of the discovered subnetworks.
They found that the critical weights were not randomly distributed. They clustered significantly in the Attention Heads, particularly in the first and last layers that were masked.

Figure 3 shows a heatmap of mask density. Brighter colors mean more weights were removed (i.e., were critical for the knowledge). You can see distinct bands at Layer 7 and Layer 12, specifically in the Attention Output (Attn-Out) and the Key/Query/Value projections (Attn-Wq, Wk, Wv).
This aligns with previous mechanistic interpretability research suggesting that early layers process low-level features and later layers consolidate specific outputs, with attention heads acting as the routing mechanism for information.
Deep Dive into Attention Heads
When zooming in on specific attention heads (Figure 5), the localization becomes even more apparent.

Notice the bright yellow squares? For the “representation” knowledge graph (middle row), Head 10 in Layer 7 is lit up. This suggests that this specific attention head plays a massive, disproportionate role in processing that specific type of relational knowledge.
Is the Subnetwork Real or Just a Hack?
A common critique in this field is “spurious correlations.” Did the mask actually find the knowledge circuitry, or did it just find a “hack” to break the model’s output for those specific sentences?
To test this, the authors performed a sensitivity analysis (Figure 2). They took the remaining model (with the knowledge removed) and started randomly pruning more weights.

If the subnetwork was just a fragile “hack,” you might expect the knowledge to suddenly reappear or the model to collapse if you messed with it further. Instead, the suppression remained robust. The TargetKG perplexity (dark lines in the first column) stayed high, while Control perplexity (second column) only degraded slowly, matching the random baseline. This suggests the surgery was indeed structural and precise.
Uniqueness of Subnetworks
Another fascinating finding: If you run the experiment three times with different random seeds, do you find the same weights?
Surprisingly, no.

As shown in Figure 8, the overlap (intersection) between seeds is tiny—often less than 4%.
This implies that knowledge in LLMs is highly redundant. There isn’t just one circuit for “A cafe is a restaurant.” There are likely many parallel pathways. The masking algorithm finds one sufficient set of weights to break the knowledge, but different seeds find different pathways. To fully erase a concept, you might need to target the union of these subnetworks.
Consequences: Can the Model Relearn?
If we surgically remove the knowledge that “CommonSenseQA” requires, can the model figure it out anyway via context or finetuning?
The researchers tested this on the CommonsenseQA dataset. They removed subnetworks critical for specific ConceptNet relations and then tried to finetune the model on the task.

The results in Table 6 are stark.
- Full Model: ~37-48% accuracy.
- Weight Mask Removed: Significant drop in accuracy (-6.8% to -14.4% depending on tuning method) on questions requiring the suppressed knowledge (“Filtered” column).
Even with finetuning (LoRA or Head Tuning), the model struggled to recover the lost knowledge. It seems that once these critical weights are gone, the model loses the underlying “scaffolding” required to process or relearn those specific relationships efficiently.
Conclusion: The Path Forward
This paper provides a compelling step forward in “Mechanistic Interpretability.” It moves us away from treating LLMs as mysterious black boxes and toward viewing them as collections of discoverable, functional circuits.
Key Takeaways:
- Knowledge is Sparse: We can delete specific facts by removing <2% of the model’s weights.
- Weights > Neurons: Individual weights are the atomic unit of knowledge storage; neurons are too broad and polysemantic.
- Redundancy: There are multiple independent circuits encoding the same knowledge (evidenced by the low overlap between seeds).
- Transfer: Removing these weights genuinely impairs the model’s ability to use that knowledge in downstream reasoning tasks.
For students and researchers, this opens up exciting avenues. Could we build “unlearning” algorithms to make AI safer? Could we update old knowledge (e.g., “The Prime Minister is…”) without retraining the whole model? By identifying these knowledge-critical subnetworks, we are one step closer to truly understanding the ghost in the machine.
](https://deep-paper.org/en/paper/2310.03084/images/cover.png)