The interior of a Large Language Model (LLM) is often described as a “black box.” We know what goes in (a prompt) and we know what comes out (a coherent continuation), but the billions of calculations in between remain largely opaque.
For students and researchers in Natural Language Processing (NLP), this opacity is a problem. If we don’t know how a model works, we can’t fully trust it, fix it when it hallucinates, or prevent it from exhibiting bias.
In recent years, the field of Mechanistic Interpretability has emerged to crack this code. We have successfully studied Attention mechanisms (how models move information between words) and Multi-Layer Perceptrons (MLPs) (how models store and process knowledge) in isolation. But these two components don’t work alone; they are woven together in the Transformer architecture.
This post deep-dives into a fascinating research paper, “Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions,” which shines a light on exactly how these two components talk to each other. The researchers uncover a specific mechanism where Attention heads “look up” context and signal specific MLP neurons to fire, effectively guiding the model’s next word prediction.
The Problem: Divided We Fall
To understand the paper’s contribution, we first need to look at the standard view of the Transformer. A Transformer layer typically consists of two main sub-layers:
- Attention Mechanism: This allows the model to look at other words in the sentence to gather context. It effectively says, “To understand the word ‘bank’ here, I need to look at the word ‘river’ earlier in the sentence.”
- MLP (Feed-Forward Network): This processes the information gathered. Recent research suggests MLPs act as Key-Value Memories, storing facts and linguistic patterns.
Historically, interpretability research has treated these independently. We analyze attention maps to see “what the model is looking at,” or we analyze MLP neurons to see “what concept activates this neuron.”
The Missing Link: The researchers argue that this separation is artificial. In reality, an Attention head might detect a specific pattern (context) and pass that information directly to an MLP neuron to execute a prediction.

Figure 1 illustrates the core hypothesis. Notice the flow:
- Context Detection: Attention heads (the colorful trapezoids) scan the input tokens (\(x_0...x_k\)) looking for specific contexts (e.g., “if context A”).
- Handoff: The output of the attention head is added to the residual stream.
- Activation: This specific signal activates a “Next-Token Neuron” in the MLP layer.
- Prediction: That neuron promotes a specific word (token) in the output vocabulary.
This suggests a collaborative “circuit” where the Attention head acts as the detector (“I see a context implying a comparison!”) and the MLP neuron acts as the executor (“I will predict the word ’than’!”).
Background: The Building Blocks
Before dissecting the methodology, let’s establish a few foundational concepts used in the paper.
1. Next-Token Neurons
Not all neurons in an LLM are easy to interpret. However, researchers have identified a specific class called Next-Token Neurons. These are MLP neurons whose output weights align almost perfectly with the embedding of a specific word in the vocabulary.
When a “Next-Token Neuron” fires, it directly increases the probability of that specific word being generated next. For example, there might be a neuron dedicated solely to increasing the probability of the word “basketball.”
2. The Interpretability Pipeline
The authors utilize a modern interpretability approach that leverages stronger models (like GPT-4) to explain weaker models (like GPT-2). This is sometimes called “Auto-Interpretability.” The logic is:
- Find a mysterious component in GPT-2.
- Show its behavior to GPT-4.
- Ask GPT-4: “What is this component doing?”
The Methodology: A 5-Step Deep Dive
The core of this paper is a rigorous, five-step pipeline designed to find and verify these Attention-MLP interactions. Let’s walk through it.

Figure 2 provides the roadmap. We will break down each step below.
Step 1: Identify Next-Token Neurons
First, we need to find the “executors”—the neurons in the MLP layers that are trying to predict specific words. The researchers look at the last few layers of the model (specifically the last 5 layers of GPT-2 Large).
They define a congruence score (\(s_i\)) for each neuron to measure how strongly it maps to a specific token in the vocabulary.

In this equation:
- \(\mathbf{w}_{\mathrm{out}}^i\) is the output weight of neuron \(i\).
- \(\mathbf{e}^t\) is the embedding vector of token \(t\).
- The dot product \(\langle \cdot, \cdot \rangle\) measures similarity.
If a neuron has a massive score for the word “apple,” it is effectively an “apple-predicting neuron.”

As shown in Figure 3, these high-scoring neurons are overwhelmingly found in the final layers of the model (layers 30-35 in GPT-2 Large). This makes sense; the earlier layers process abstract concepts, while the final layers must prepare the specific words for output.
Step 2: Find Max-Activating Prompts
Once we have a target neuron (e.g., a neuron that predicts “go”), we need to know what triggers it. The researchers run thousands of prompts from “The Pile” (a massive dataset) through the model and pick the top 20 prompts that make this neuron fire the hardest.
However, prompts can be long and noisy. To isolate the exact trigger, the researchers perform Prompt Truncation.

Figure 4 demonstrates this beautifully. The original prompt is a long news article about a baseball stadium. The truncation process cuts away the beginning of the text until the neuron’s activation drops below 80%.
- Original: Long article … “baseball stadiums can come and…”
- Truncated: “while baseball stadiums can come and…”
This leaves us with a concise trigger: the phrase “come and,” which strongly implies the next word “go.”
Step 3: Attribute to Attention Heads
Now for the critical link. We have the neuron (the destination) and the prompt (the trigger). Which Attention head (the source) sent the signal?
The researchers calculate a Head Attribution Score. This measures how much a specific attention head contributed to the activation of that neuron.

Ideally, we are looking for a high dot product between the attention head’s output (\(h_{i,k}\)) and the neuron’s input weights (\(e_{j,l}\)). If the score is high, it means the attention head is “shouting” in the exact direction that the neuron is “listening.”
Step 4 & 5: Automating Explanation with GPT-4
We now have a list of prompts where a specific Attention Head activates a specific Next-Token Neuron. But why?
This is where GPT-4 comes in as an analyst. The researchers provide GPT-4 with two lists:
- Active Examples: Prompts where the head fired strongly.
- Inactive Examples: Prompts where the head did not fire.
GPT-4 is asked to generate a natural language explanation of the pattern.

Figure 7 (above) offers a detailed walkthrough of this specific part of the process using an “as” neuron.
- Observations: The head fires on “as early as” and “as far back as”.
- GPT-4 Explanation: “Head 519 is active when the prompt contains a phrase that refers to an approximate range.”
- Validation: The researchers then test this explanation. They give GPT-4 new prompts and ask, “Based on your explanation, will the head fire here?” If GPT-4 guesses correctly, the explanation is deemed accurate.
Results: Do Attention Heads Understand Context?
The results confirm that this interaction is real and interpretable. The researchers found that many attention heads have highly specific jobs: they recognize a linguistic context and signal the MLP to predict the appropriate word.
Qualitative Success: Meaningful Patterns
The automated explanations revealed fascinating linguistic roles for attention heads.

In Table 1, we see a head connected to the “as” neuron. It doesn’t just fire every time “as” appears.
- Active: “as early as”, “as high as” (Comparisons/Ranges).
- Inactive: “such as”, “so long as”.
This head essentially acts as a “Comparative Grammar Detector.”
Table 2 provides even more examples.

Look at the “number” token. The model has different heads for different contexts of the same word:
- Head (31, 364, 519): Activates for “number” in the context of ranking (e.g., “number one”).
- Head (31, 364, 548): Activates for “number” in the context of identification (e.g., “phone number”).
This proves that the Attention heads are performing Context Disambiguation. They look at the sentence, realize which definition of “number” is relevant, and activate the “number” neuron with that specific context in mind.
Quantitative Success: The Explanation Score
To measure how well GPT-4 could explain these heads, the authors defined the Head Explanation Score (\(\mathcal{E}\)).

This score is essentially the accuracy of GPT-4’s predictions on new data (True Positives and True Negatives). A score of 0.5 is random guessing; 1.0 is perfect understanding.

Figure 5 shows the distribution of these scores across different models (GPT-2 and Pythia).
- The blue bars represent the real attention heads. Notice the “rightward skew,” meaning many heads have high scores (0.6, 0.7, 0.8).
- The red bars (in graph d) represent random neurons. These center around 0.5.
The statistical difference (shown by the p-values) confirms that the relationship between these Attention heads and Next-Token neurons is not random noise—it is a structured, explainable mechanism.
Verification by Ablation
Finally, science demands verification. If these attention heads are truly responsible for predicting these tokens, then “switching them off” (ablation) should hurt the model’s performance.
The researchers took the prompts where a head was active and manually zeroed out that head’s output during the forward pass.

Figure 6 shows the result.
- Active (Orange box): When the head was supposed to be active, ablating it caused a significant drop in the probability of the correct token (values below zero).
- Inactive (Blue box): When the head wasn’t active anyway, ablating it did almost nothing (values near zero).
This confirms the causal link: The Attention head is a necessary component for the correct firing of the Next-Token neuron.
Conclusion and Implications
This research paper provides a crucial piece of the puzzle in understanding Large Language Models. It moves us beyond studying components in isolation and highlights the interactions that drive intelligence.
Key Takeaways:
- The Circuit Exists: There is a clear, repeatable mechanism where Attention heads detect context and “pass the baton” to MLP neurons to predict the next word.
- Context Specialization: The same word (like “number”) is triggered by different attention heads depending on the semantic context (ranking vs. identification).
- Automated Interpretability Works: Using GPT-4 to explain the inner workings of smaller models is a viable, scalable strategy.
Why This Matters For students entering the field, this paper demonstrates that LLMs are not inscrutable magic. They are composed of discoverable circuits. Understanding these circuits is the first step toward “debugging” AI—fixing biases, removing harmful knowledge, or simply making models more efficient.
The next time you see an LLM predict the perfect word to finish a sentence, remember: it wasn’t just a random guess. A specific Attention head likely spotted the context miles away and signaled a specific neuron to get ready to fire.
Algorithm Outline used in the paper:

](https://deep-paper.org/en/paper/2402.15055/images/cover.png)