Introduction

We have become accustomed to the “magic” of Large Language Models (LLMs). You type a prompt—whether it’s a request to translate a sentence, summarize a paragraph, or analyze the sentiment of a review—and the model obliges. But beneath the surface, what is actually happening inside the neural network?

While we know how to train these models—feeding them massive datasets and refining them with instruction tuning—we often treat the resulting model as a “black box.” We know the input and the output, but the internal mechanics of where and when the model decides to perform a specific task remain largely a mystery. Does the model know it is performing a translation in the very first layer? or does that realization happen right before the final output?

A fascinating paper titled “Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models” by researchers from the University of Edinburgh and Nvidia attempts to map this internal geography. By analyzing over 60 different Natural Language Processing (NLP) tasks, the authors provide a functional map of an LLM’s layers, revealing exactly where general knowledge transforms into task-specific action.

Background: The Anatomy of Instruction Tuning

To understand this research, we must first distinguish between a pre-trained LLM and an instruction-tuned LLM.

Pre-trained LLMs (like the base Llama 2 model) are trained on vast amounts of text to simply predict the next token. They understand language patterns but aren’t necessarily good at following specific commands.
Instruction-tuned LLMs (like Llama 2-SFT) undergo a second phase of training. They are fed pairs of instructions and answers across many different tasks (e.g., “Translate this to French: [Input]” -> “[Output]”). This teaches the model to act as a multi-task assistant.

The central question of this paper is: How does instruction tuning change the internal representations of the model?

Does it change the entire brain of the model? Or does it only affect specific regions? To find out, the researchers employed a clever comparative technique.

The Methodology: The “Jack of All Trades” vs. The “Master of One”

Standard methods for analyzing LLMs often involve “probing,” where a small classifier is trained to see if a specific layer contains certain information. However, probing is flawed because it relies on performance metrics that vary wildly between tasks (e.g., you can’t easily compare a BLEU score for translation against an F1 score for classification).

Instead, the authors used a method called MOSSA (Model-Oriented Sub-population and Spectral Analysis) combined with CKA (Center Kernel Alignment).

Here is the intuition behind their approach: Imagine you have an instruction-tuned model (let’s call it the Experimental Model). It is a “Jack of all trades”—it knows how to do 60+ tasks. Now, imagine you train a separate copy of Llama 2 only on one specific task, like Sentiment Analysis. This acts as a Control Model—a “Master of One.”

If the representations (the numerical patterns) in Layer 15 of the “Jack of all trades” look identical to Layer 15 of the “Master of One,” we can conclude that Layer 15 contains the specific knowledge required for that task.

Measuring Similarity with CKA

To mathematically measure how similar two layers are, the researchers used Centered Kernel Alignment (CKA). CKA allows us to compare the hidden states of two different neural networks, even if they aren’t perfectly aligned, by looking at the geometric structure of their activations.

The formula for CKA is calculated as follows:

The mathematical formula for Centered Kernel Alignment (CKA), used to measure the similarity between neural network representations.

In this equation:

\(\mathbf{Y}_t\) and \(\mathbf{Z}_t\) represent the matrices of activation patterns for the Experimental and Control models respectively.
HSIC (Hilbert-Schmidt Independence Criterion) measures the statistical dependence between these distributions.

Essentially, a CKA score of 1.0 means the layers are identical in what they represent, while 0.0 means they are completely unrelated.

The Core Discovery: A Three-Stage Rocket

The most significant contribution of this paper is the mapping of LLM layers into three distinct functional groups. By observing where the multi-task model (Llama 2-SFT) aligns with the single-task specialists, the authors discovered that instruction-tuned models process information in a specific sequence.

As illustrated in the diagram below, the processing flow is not uniform. It evolves through stages.

A diagram of the Llama 2 architecture illustrating the three functional sections: Shared Layers (1-9), Transition Layers (10-15), and Refinement Layers (16-32).

1. The Shared Layers (Layers 1–9)

In the bottom third of the network, the model engages in general processing. The representations here are highly similar across different tasks. Whether the model is about to translate a poem or solve a math problem, the first 9 layers look largely the same. This suggests that these layers handle fundamental linguistic features—grammar, syntax, and basic world knowledge—that are “shared” regardless of the intent.

2. The Transition Layers (Layers 10–15)

This is where the magic happens. The researchers identified a critical zone in the middle of the network where the model pivots from general understanding to task-specific execution. In these layers, the representations shift dramatically to align with the “Master of One” control models. This is the “switchboard” of the LLM, where the instruction “Translate this” is actually processed and routed into a specific mode of operation.

Once the task is identified and the correct “mode” is engaged, the remaining layers are used to refine the output. These layers continue to hold high similarity to the task-specific control models, polishing the tokens to ensure the answer is correct.

Experiments and Results

To validate this three-stage theory, the authors conducted extensive experiments using the Flan 2021 dataset, which covers over 60 NLP tasks. They compared the pre-trained Llama 2 base model against the instruction-tuned Llama 2-SFT.

Pre-trained vs. Instruction-Tuned

The difference between the raw base model and the tuned model is striking. Look at the CKA similarity distributions below:

Box plots comparing CKA similarity distributions across layers for Llama 2 (blue) and Llama 2-SFT (orange).

In the Llama 2 (Blue) boxplots, we see a sharp decline in similarity as we go deeper into the layers. This means the base model “forgets” or loses the specific structure of the task as it processes deeper, which explains why base models often ramble or fail to follow specific instructions.

In contrast, the Llama 2-SFT (Orange) model maintains high similarity to the single-task specialists throughout the middle and deep layers. It effectively “locks on” to the task in the Transition Layers and stays locked on.

Divergence by Task Type

Not all tasks are created equal. Some tasks, like Sentiment Analysis, are relatively simple—they rely mostly on surface-level semantic understanding. Others, like Translation or Summarization, require complex structural manipulation of the text.

The study broke down performance by task clusters:

Line charts showing CKA similarity trends for specific tasks like Coreference resolution, Sentiment analysis, and Translation.

Sentiment Analysis: Notice how both the Base Model (Blue) and SFT Model (Orange) perform similarly high. This implies that pre-trained models already “know” sentiment quite well without much tuning.
Translation & Struct-to-Text: Here, the divergence is massive. The Base Model (Blue) has very low similarity to the specialist, indicating it struggles with these tasks naturally. The SFT Model (Orange), however, shoots up in similarity during the Transition Layers (around layer 10-15) and stays high. This proves that instruction tuning is most critical for complex, generative tasks.

Visualizing the “Transition”

To visually prove that the Transition Layers are where tasks are distinguished, the researchers used t-SNE, a technique for visualizing high-dimensional data in 2D. They plotted the activation patterns for different tasks.

t-SNE visualizations showing how task clusters become distinct in the middle layers of the instruction-tuned model.

In the top row (Base Llama 2), the dots (representing different tasks) are somewhat jumbled, even in deeper layers.

In the bottom row (Instruction-Tuned Llama 2-SFT), look at Layer 15 and Layer 20. You can see clear, distinct islands of colors forming. This visualizes the moment the model separates “Translation” from “Reading Comprehension.” The Transition Layers act as a prism, splitting the white light of general language into the constituent colors of specific tasks.

Complexity and Dimensionality

If the Transition Layers are doing the heavy lifting of switching tasks, they should be mathematically more complex. The researchers analyzed the dimensionality required to explain the variance in these layers.

Graph showing the number of dimensions required to explain representational variance, peaking in the middle layers for the SFT model.

As shown in the graph (Orange line), the complexity skyrockets in the SFT model exactly during the Transition Layers (10-15), reaching a plateau in the Refinement Layers. The model is activating a massive number of features to manage the specific requirements of the task.

Furthermore, they found a correlation with Readability.

Heatmaps showing correlations between CKA similarity and text difficulty metrics like Flesch-Kincaid and Coleman-Liau.

The Transition Layers (10-15) show a strong positive correlation with reading difficulty (Flesch-Kincaid). This suggests that when the input text is more complex, the model relies even more heavily on these task-specific transition layers to decode the instruction and formulate a plan.

Implications for Unseen Tasks

A common criticism of Deep Learning is that models might just be memorizing their training data. Does this “Shared -> Transition -> Refinement” structure hold up for tasks the model has never seen before?

The researchers tested the model on 7 held-out tasks that were not part of the training set (e.g., Math questions, Linguistic acceptability).

Graph showing CKA similarity on unseen tasks. The SFT model eventually overtakes the base model in deeper layers.

The results (Orange line) show that for unseen tasks, the SFT model actually starts with lower similarity in the Shared Layers compared to the base model. This is counter-intuitive but positive: it means the SFT model has learned a more generalized, flexible representation at the bottom.

However, as it moves into the Transition and Refinement layers, the SFT model overtakes the Base model. Even for tasks it hasn’t practiced, the instruction-tuned architecture is better at routing and refining the information than the raw base model.

Conclusion

This research provides a pivotal step toward demystifying Large Language Models. By peeling back the layers of Llama 2, the authors have shown that “thinking” in an LLM is a structured, three-step process:

Shared Layers: Gather general linguistic context.
Transition Layers: Interpret the instruction and switch to the specific task mode.
Refinement Layers: Execute the task and polish the output.

This has massive implications for the future of AI development. If we know that the “Transition Layers” are where the critical adaptation happens, we might be able to develop more efficient fine-tuning methods (Parameter-Efficient Fine-Tuning or PEFT) that only target these specific layers, saving massive amounts of compute. It also suggests that for model compression, we might be able to prune the Shared or Refinement layers more aggressively than the Transition layers.

Instruction tuning doesn’t just teach a model new facts; it fundamentally rewires the middle of the network to become a versatile switchboard, capable of routing any prompt to the correct skill set.

Inside the Black Box: How Instruction Tuning Rewires LLM Layers

Introduction

Background: The Anatomy of Instruction Tuning