The inner workings of Large Language Models (LLMs) often feel like a black box. We feed a prompt into one end, and a coherent response magically appears at the other. We know the architecture—Transformers, attention heads, feed-forward networks—but understanding exactly how a specific input token influences a specific output prediction remains one of the hardest challenges in AI research.

Traditionally, researchers have tried to reverse-engineer these models using “circuits”—subgraphs of the model responsible for specific tasks. However, finding these circuits is usually a manual, labor-intensive process that requires human intuition to design specific test cases.

In this post, we are diving deep into a paper that proposes a scalable, automated alternative: Information Flow Routes. By viewing the model as a graph and tracing the flow of information from the top down, the researchers provide a method that is not only 100 times faster than current techniques but also capable of explaining any prediction without human pre-configuration. We will explore how this method works, the mathematics behind it, and the fascinating insights it reveals about models like Llama 2.

The Problem with Patching

Before understanding the solution, we must understand the status quo. The dominant method for interpretability in recent years has been activation patching (also known as causal mediation analysis).

Imagine you want to know if a specific neuron is important for predicting the word “Paris” after the prompt “The capital of France is…”. In activation patching, you would:

Run the model with the original prompt.
Run the model again with a “corrupted” prompt (e.g., “The capital of Italy is…”).
Swap (patch) the activation of that specific neuron from the corrupted run into the original run.
See if the model fails to predict “Paris”.

If the prediction breaks, that neuron is part of the circuit. While effective, this approach has significant downsides:

It is slow: You have to run two forward passes and perform interventions for potentially every component in the network.
It requires human design: You must invent “contrastive templates” (France vs. Italy) to isolate the specific behavior you want to study.
It is fragile: The results can change drastically depending on which contrastive template you choose.

The researchers behind “Information Flow Routes” argue that we need a method that discovers the relevant circuit automatically, for any input, in a single pass.

The Concept: The Model as a Graph

To automate interpretability, we first need to visualize the Transformer not as a stack of layers, but as a directed graph.

In this graph:

Nodes represent token representations at different stages (e.g., a token after an attention layer, or after a Feed-Forward Network layer).
Edges represent the operations inside the model (computation) that move information from one node to another.

Figure 2. Full information flow graph.

As shown in Figure 2, the graph can get complicated. At every layer, information flows from the previous layer’s residual stream into Attention blocks and Feed-Forward (FFN) blocks.

Green lines represent attention mechanisms, where information moves between different token positions (e.g., the word “Mary” attending to the word “John”).
Purple lines represent the internal processing of a token within a Feed-Forward Network.
Gray lines represent the residual connection—the “highway” that allows information to bypass layers unchanged.

During a standard forward pass, all these edges are active. However, for a specific prediction (like predicting the next word in a sentence), only a tiny fraction of these computations actually matter. This small, relevant subset is what the authors call the Information Flow Route.

The Core Method: Extracting Routes via Attribution

The goal is to extract the important subgraph from the massive full graph. Instead of “patching” (intervening and breaking things), this paper uses attribution. They trace the signal backwards from the final prediction to identify which components contributed to the result.

This approach is roughly 100 times faster than patching because it only requires a single forward pass and a calculation of contributions, rather than thousands of experimental interventions.

Step 1: Defining Edge Importance

How do we mathematically decide if an edge is “important”? The authors adopt a method called ALTI (Aggregation of Layer-Wise Token-to-Token Interactions).

The intuition is geometric. In a Transformer, a node’s value is usually a sum of vectors (thanks to the residual connections). If we have a resulting vector \(y\) composed of several input vectors \(z\), the importance of a specific input \(z_j\) is determined by how “close” it is to the final sum \(y\).

The paper defines the proximity of a contribution \(z_j\) to the result \(y\) using the following equation:

\[ p r o x i m i t y ( z _ { j } , \pmb { y } ) = \operatorname* { m a x } ( - | | z _ { j } - \pmb { y } | | _ { 1 } + | | \pmb { y } | | _ { 1 } , 0 ) . \]

Essentially, this measures the distance between the contribution and the total sum. If the distance is small, the contribution \(z_j\) is very similar to the result \(y\), meaning it dominates the information content.

This proximity is then normalized to get a percentage-based importance score:

\[ i m p o r t a n c e ( z _ { j } , \pmb { y } ) = \frac { p r o x i m i t y ( z _ { j } , \pmb { y } ) } { \sum _ { k } p r o x i m i t y ( z _ { k } , \pmb { y } ) } , \]

Step 2: Decomposing Attention

Calculating importance for a Feed-Forward Network (FFN) is straightforward because it processes tokens independently. However, Attention blocks are complex. They mix information from all previous tokens. To build a precise graph, the researchers decompose the attention block into smaller “sub-edges.”

Typically, an attention head’s output is a sum of weighted values from all previous tokens. The authors break this down so we can see the exact contribution of one specific token passing through one specific head.

Figure 4. Decomposition of an update coming from an attention head into per-input terms. Layer indices are omited for readability.

As visualized in Figure 4, the update from an attention head is decomposed into per-input terms.

\(\alpha_{pos,j}^h\) represents the attention weight (how much focus is placed on token \(j\)).
\(f^h(x_j)\) represents the value transformation of that token.

By multiplying these, we get an independent channel (a “sub-edge”) for every pair of tokens. This allows the algorithm to determine, for example, that “Head 5 in Layer 3 moved information specifically from ‘Mary’ to ‘John’.”

Step 3: The Top-Down Extraction Algorithm

Once we can calculate the importance of every edge, constructing the “Information Flow Route” becomes a search problem. The authors propose a greedy, top-down algorithm.

Figure 3. General-case algorithm for extracting the important subgraph, the information flow routes, from the full information graph (Figure 2).

Here is how the algorithm (shown in Figure 3) works in plain English:

Start at the end: Begin with the final node (the representation used to predict the next token).
Look down: Examine all immediate incoming edges (connections to the previous layer, attention heads, FFNs).
Filter: Calculate the importance of each edge using the equations above. If an edge’s importance is below a certain threshold (\(\tau\)), discard it.
Expand: For every edge that “survived” the filter, add the node it came from to the list of active nodes.
Repeat: Move to those new nodes and repeat the process until you reach the input embeddings.

The result is a sparse subgraph that shows exactly how the model computed the output.

Validation: Does it Work?

To prove this method works, the researchers tested it on tasks that have already been heavily studied using the slower manual methods. One famous example is the Indirect Object Identification (IOI) task.

The Task: Sentence: “When Mary and John went to the store, John gave a drink to…” Target Prediction: “Mary”

Previous research (using activation patching) identified specific “Name Mover Heads” and “S-Inhibition Heads” responsible for this logic.

When the authors ran their automatic Information Flow extraction on this task, they generated the following activation maps:

Figure 7. IOI, GPT2-Small. Attention head activation frequency (tau = 0.03).

Let’s interpret Figure 7:

Left Plot (Original Task): This shows the components the algorithm found important for predicting “Mary”. The teal dots (“Name mover heads”) and green dots (“Induction heads”) align perfectly with previous manual discoveries.
Right Plot (Difference): This shows the difference between the main task and a contrastive baseline. It highlights the task-specific machinery.

Crucially, the Information Flow method found extra components that patching missed, such as “Previous token heads” (yellow dots). Patching missed them because they are generically useful for all predictions, so they cancelled out in the contrastive test. However, for the model to actually work, those heads are vital. This proves that Information Flow Routes provide a more complete picture of the model’s computation.

Scaling Up: Insights from Llama 2

Because this method is fast and automated, the researchers applied it to Llama 2 (7B), a model far too large for easy manual circuit analysis. They analyzed thousands of sentences to uncover general principles of how the model “thinks.”

1. The Grammar of Information Flow (POS Clustering)

One of the most striking findings is that the model’s internal routing depends heavily on the Part of Speech (POS) of the input tokens.

The researchers took the importance vectors of the model components and visualized them using t-SNE (a dimensionality reduction technique).

Figure 8. t-SNE of component importance vectors. Coloured by: (a) input token POS Tag, (b) next token POS tag, (c) whether the input token is the first or a later subword. Llama 2-7B.

Look at Figure 8(a). The distinct clusters correspond to different parts of speech.

Function words (determiners, conjunctions, prepositions) form tight, distinct clusters (Black, Red, Orange dots). This suggests the model processes grammatical glue words using very specific, consistent pathways.
Content words (nouns, verbs) are more scattered and mixed. Processing a noun requires context-specific reasoning that varies largely depending on the word’s meaning, not just its grammatical role.

Figure 8(c) is also fascinating. It separates tokens based on whether they are the first subword of a word or a later subword. The model clearly distinguishes between “starting a new concept” and “completing a token,” dedicating specific machinery to merging subwords into whole words.

2. Universal Head Functions

The analysis identified two types of attention heads that appear to be universally important across almost all predictions.

Figure 9. Attention head activation frequency (tau = 0.01) and FFN block importance. We show only top-50% important heads. Llama 2-7B.

In Figure 9, we see the activation frequency of heads across layers:

Previous Token Heads (Yellow): These heads simply move information from the immediately preceding token to the current one. They are overwhelmingly important. This makes sense; language is sequential, and the immediate context is usually the most critical.
Subword Merging Heads (Red): These heads are active in the lower layers (early in the network). Their job is to aggregate information from split subwords (e.g., “to”, “ken”, “i”, “zation”) into a unified representation.

This confirms that LLMs perform a “cleanup” and “aggregation” phase in early layers before moving on to higher-level reasoning.

3. Domain Specialization

Do LLMs use the same “brain” for coding as they do for writing poetry? The study suggests the answer is no.

The researchers compared the importance of heads across different datasets: standard text (C4), Code, Multilingual text, and Arithmetic.

Figure 11. Average importance of attention heads and FFNs for different datasets. For non-general domains, we show only heads with importance higher than 0.015. Llama 2-7B.

Figure 11 reveals distinct specialization:

Code (Red circles): Specific heads light up for code that are dormant for general text.
Non-English (Blue circles): Processing foreign languages engages a different set of heads.
Arithmetic (Yellow/Cyan): Math tasks trigger yet another distinct subset of components.

This implies that Llama 2 is not a monolithic generic processor. It is more like a Swiss Army knife, activating different specialized modules depending on the domain of the text.

4. The “Period as BOS” Anomaly

A quirky, unexpected finding in Llama 2 is the behavior of the first period in a paragraph.

In many examples, the researchers found that information flowed heavily from the very first period (.) of the text to the end of the sentence. The model seems to treat the first period as a pseudo-Begin-Of-Sentence (BOS) token.

Figure 10. Examples of important information flow subgraphs (tau = 0.01). Llama 2-7B.

In Figure 10 (left), you can see the attention map. Even though the sentence continues, the model keeps attending back to that first period. This might be a “sink” where the model stores baseline information or resets its state. This is the kind of idiosyncratic behavior that is hard to hypothesize in advance but jumps out immediately when using automated route discovery.

Conclusion

The “Information Flow Routes” paper represents a significant step forward in the field of mechanistic interpretability. By moving away from manual, hypothesis-driven patching toward automatic, data-driven attribution, the researchers have given us a flashlight to illuminate the black box of LLMs at scale.

The method confirms that while models use specialized circuits for specific tasks (like IOI or arithmetic), they also rely on massive, general-purpose highways for subword merging and sequential processing. It shows us that LLMs are modular, domain-sensitive, and occasionally reliant on strange artifacts like “load-bearing periods.”

For students and practitioners, this work opens the door to debugging models more effectively. Instead of guessing why a model failed, we might soon simply “trace the route” and point to the exact broken link in the chain.

The Problem with Patching#

The Concept: The Model as a Graph#

The Core Method: Extracting Routes via Attribution#

Step 1: Defining Edge Importance#

Step 2: Decomposing Attention#

Step 3: The Top-Down Extraction Algorithm#

Validation: Does it Work?#

Scaling Up: Insights from Llama 2#

1. The Grammar of Information Flow (POS Clustering)#

2. Universal Head Functions#

3. Domain Specialization#

4. The “Period as BOS” Anomaly#

Conclusion#