Inside the Black Box: Mapping Knowledge Neurons in LLMs

Large Language Models (LLMs) like GPT-4 and Llama have demonstrated a remarkable ability to store and recall factual knowledge. When you ask an LLM, “What is the capital of France?”, it effortlessly retrieves “Paris.” But where exactly does this information live? Is “Paris” stored in a specific cluster of neurons? And if so, how does the model know when to activate them?

Understanding the mechanisms of knowledge storage is the “Holy Grail” of mechanistic interpretability. If we can pinpoint the exact neurons responsible for specific facts, we could theoretically edit out hallucinations or update outdated information without expensive retraining.

However, finding these neurons is like searching for a needle in a digital haystack. Current attribution techniques, such as integrated gradients or causal tracing, are often too computationally expensive to apply to the millions of neurons in modern LLMs.

In the paper “Neuron-Level Knowledge Attribution in Large Language Models,” researchers Zeping Yu and Sophia Ananiadou propose a novel, efficient framework for solving this problem. They introduce a static method for pinpointing “value neurons” (where knowledge is stored) and “query neurons” (what triggers that knowledge). Their work offers a granular map of information flow within Transformer models.

Background: The Anatomy of a Transformer

To understand how the researchers track knowledge, we must first understand the fundamental units of a Transformer model. An LLM processes an input sentence \(X = [t_1, t_2, ..., t_T]\) through a series of layers. Each layer consists primarily of two sub-modules: Multi-Head Self-Attention (MHSA) and a Feed-Forward Network (FFN).

The flow of information through a layer can be described by the residual connection equation:

The residual connection equation showing the layer output as the sum of input, attention, and FFN output.

Here, \(h_i^{l-1}\) is the input from the previous layer, \(A_i^l\) is the output of the attention mechanism, and \(F_i^l\) is the output of the FFN.

Finally, at the very end of the model (layer \(L\), position \(T\)), the output vector is projected into the vocabulary space to predict the next token:

The softmax equation for calculating the final probability distribution.

Defining the “Neuron”

In biological brains, a neuron is a cell. In Transformers, the definition is more mathematical.

1. FFN Neurons: Researchers often view the Feed-Forward Network as a Key-Value memory system. The output of an FFN layer is a weighted sum of specific vectors:

The FFN output represented as a weighted sum of neurons.

In this context, a neuron is defined as the subvalue vector \(fc2_k^l\) (a column in the second linear layer). Its activation, or “coefficient score” \(m_{i,k}^l\), is determined by how well the input matches a corresponding “subkey” \(fc1_k^l\) (a row in the first linear layer):

The coefficient score calculation for FFN neurons.

2. Attention Neurons: Similarly, the attention mechanism calculates a weighted sum of outputs from different heads:

The attention output equation showing summation over heads.

The researchers extend the neuron definition to attention heads. They regard the columns of the output matrix \(W^o\) as “attention subvalues” (neurons) and the rows of the value matrix \(W^v\) as subkeys.

Core Method: Tracing the Flow of Knowledge

The central challenge is determining which of these millions of neurons actually contributed to a specific prediction (e.g., predicting “Paris”).

The Problem with Existing Metrics

A common intuition is to look for neurons with high activation scores. However, the researchers argue that looking at activation alone is misleading. A neuron might have a high activation score but contribute generic information that doesn’t help distinguish “Paris” from “London.”

To understand this, we have to look at how a neuron vector \(v\) affects the probability distribution when added to the residual stream \(x\). The researchers analyze the “before-softmax” (bs) values:

The vector of before-softmax values for all tokens in the vocabulary.

The probability of a specific word \(w\) is calculated using the softmax function over these values:

The probability of word w given vector x.

When a neuron \(v\) adds its information to the stream (\(x + v\)), the probability shifts:

The probability of word w given vector x plus neuron v.

Crucially, the change in the before-softmax values is linear:

The linearity of before-softmax values.

However, the resulting probability change is non-linear. The researchers illustrate this with a hypothetical example (Table 1) showing that the impact of a neuron depends heavily on the existing state of the residual stream (\(x\)). A neuron that simply adds a constant value to all tokens might not change the ranking at all. Conversely, a neuron with lower raw values might drastically shift the probability if it targets the specific correct token while suppressing others.

The Solution: Log Probability Increase

To capture this complexity without running expensive gradients, the researchers propose a new metric: Log Probability Increase.

They define the importance (\(Imp\)) of a neuron (or layer vector) \(v^l\) as the difference in log probability of the correct token \(w\) when the vector is included versus when it is not:

The Importance Score equation based on log probability increase.

This metric accounts for both the neuron’s contribution and the context of the residual stream (\(h^{l-1}\)). It asks: “How much ‘surprise’ does this neuron remove regarding the correct answer?”

Identifying Query Neurons

Finding the neurons that store the value (the “Value Neurons”) is only half the battle. We also need to know what activated them. The researchers introduce a method to find “Query Neurons.”

Since the activation of a value neuron depends on the inner product between the input and the subkey (as seen in Eq. 7), the researchers calculate the inner product between the subkey of a significant value neuron and the neurons in the previous layers. If a neuron in a previous layer has a high inner product with the value neuron’s key, it is identified as a Query Neuron.

This creates a complete circuit view:

  1. Query Neurons (usually in shallower layers) extract features from the input.
  2. They activate Value Neurons (usually in deeper layers).
  3. Value Neurons write the factual information into the residual stream.

The diagram below illustrates this architecture, showing the flow from FFN query neurons and Attention neurons up to the final FFN value neurons.

Diagram illustrating Query Neurons, Attention Neurons, and Value Neurons in a Transformer architecture.

Experiments and Results

To validate their method, the researchers tested it on the TriviaQA dataset using two models: GPT2-large and Llama-7B. They extracted sentences where the models correctly predicted answers related to six categories: language, capital, country, color, number, and month.

Comparison with Other Methods

They compared their “Log Probability Increase” method against seven other static attribution methods (such as raw probability, norm, coefficient score, etc.). The evaluation involved identifying the top 10 most important FFN neurons according to each method, “turning them off” (setting parameters to zero), and measuring the drop in the model’s performance.

The Logic: If the method truly finds the important neurons, turning them off should break the model’s ability to answer the question.

The results were decisive:

Table showing that the proposed method causes the largest drop in MRR and Probability compared to other methods.

As shown in Table 2 (above), the proposed method (row ‘a’) caused the most significant drop in performance. For Llama-7B, the probability of the correct token plummeted from 55.1% to 9.2% by removing just 10 neurons. This vastly outperformed methods like simple coefficient scores (row ’e’) or vector norms (row ’d’).

Where are the Neurons Located?

Using this validated method, the researchers mapped the distribution of important neurons across the models’ layers.

Graph showing neuron distribution across layers in Llama-7B.

Figure 2 reveals a striking trend: Important value neurons are overwhelmingly concentrated in the deep layers (layers 20-32 in Llama-7B).

Interestingly, the graph compares “Log Prob Increase” (Red) vs. “Prob Increase” (Green). The “Prob Increase” method biases heavily toward the very last layers. The researchers explain this using the theoretical curves below:

Curves showing the difference between log probability increase and raw probability increase.

Figure 3 shows that as the model gets closer to the solution (the segment index increases), the raw probability (right) shoots up exponentially only at the very end. Log probability (left), however, grows more linearly. This makes “Log Probability Increase” a more sensitive tool for detecting important contributions in the medium-to-deep layers, not just the final layer.

Analyzing Knowledge Storage

The researchers broke down their analysis by knowledge type (e.g., Capitals vs. Colors) to see if different types of facts are stored differently.

Layer-Level Attribution

The heatmap below visualizes the importance of different layers for GPT2. Darker colors indicate higher importance.

Heatmap of top 10 important value layers in GPT2.

Figure 4 (and the table included within) confirms that knowledge is stored in deep layers. Moreover, it shows semantic clustering. Notice how “Language,” “Capital,” and “Country” (semantic-heavy tasks) light up similar layers (e.g., Layer 26, 30). In contrast, “Number” and “Color” rely on different sets of layers. This suggests the model organizes knowledge physically by semantic category.

The same pattern holds true for Llama-7B, as seen in Figure 5:

Heatmap of top 10 important value layers in Llama.

Neuron-Level Sparsity

One of the most profound findings is the sparsity of knowledge. The researchers found that while a model has billions of parameters, the specific knowledge for a query is handled by a tiny fraction of them.

Table showing the importance of top neurons in Attention and FFN layers.

Table 5 shows the cumulative importance score. The “Top 200” neurons (a microscopic fraction of the total) account for an importance score almost equal to “All” neurons combined. This confirms that knowledge is not diffusely spread across the entire network but is localized in specific, retrieveable points.

The “Query” Layer Analysis

Finally, the researchers looked at which layers act as the “Query” signal—the layers that activate the deep value neurons.

Heatmap of top 10 important query layers in GPT2.

Figure 6 (GPT2) and Figure 7 (Llama, below) show that medium-deep attention layers play a massive role in querying.

Heatmap of top 10 important query layers in Llama.

This creates a clear picture of the information processing pipeline:

  1. Shallow/Medium Layers: Process syntax and context (Query Neurons).
  2. Deep Layers: Retrieve specific factual associations (Value Neurons).

When the researchers analyzed the “Query” neurons specifically, they found that unlike Value neurons (which often map directly to output words like “Paris”), Query neurons are less interpretable. They seem to function as abstract triggers rather than holding the content themselves.

Conclusion and Implications

This research provides a static, computationally efficient method to open the “black box” of Large Language Models. By differentiating between Value Neurons (the storage) and Query Neurons (the key), and using Log Probability Increase as a metric, the authors were able to outperform existing attribution methods significantly.

The key takeaways are:

  1. Deep Storage: Factual knowledge is predominantly stored in the deep layers of the network.
  2. Semantic Locality: Similar types of knowledge (e.g., geography) are stored in overlapping regions.
  3. Sparsity: A tiny number of neurons (often fewer than 300) are responsible for specific predictions. Intervening on these few neurons can completely alter the model’s output.
  4. Distinct Roles: Medium layers act as queries that unlock the facts stored in deep layers.

Why does this matter? This level of granularity is the foundation for Model Editing. If we can reliably locate the 300 neurons responsible for a specific hallucination or a biased association, we can surgically alter them without retraining the entire model. This moves us one step closer to safer, more interpretable, and more reliable Artificial Intelligence.