Deconstructing In-Context Learning: The Two-Tower Mechanism Hidden Inside LLMs

Large Language Models (LLMs) like GPT-4 and Llama have displayed a fascinating emergent ability known as In-Context Learning (ICL). This is the phenomenon where you provide a model with a few examples (demonstrations) in the prompt—like “English: Cat, French: Chat”—and the model instantly learns the pattern to complete a new example, all without any parameter updates or retraining.

While we use ICL every day, the underlying mechanism remains somewhat of a “black box.” How exactly does the model move information from the demonstration examples to the final prediction? Does it actually “learn” the task, or is it just relying on pre-existing knowledge?

In a fascinating paper titled “How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning,” researchers Zeping Yu and Sophia Ananiadou from the University of Manchester peel back the layers of the Transformer architecture. They propose a compelling hypothesis: ICL operates through a specific set of attention heads that function like a Two-Tower metric learning system.

In this deep dive, we will walk through their methodology, their discovery of “In-Context Heads,” and how this new understanding explains puzzling behaviors like majority label bias and recency bias.

Introduction: The Mystery of the Mechanism

To understand how a model learns in-context, we first need to isolate the learning process from the model’s prior knowledge. If you ask a model to classify movie reviews as “Positive” or “Negative,” it might just rely on the fact that it knows the word “excellent” is positive. It’s hard to tell if the model is looking at your examples or just using its training data.

To solve this, the researchers focused on Task Learning (TL) using semantically unrelated labels. Instead of “Positive/Negative,” they force the model to map inputs to arbitrary labels like “foo” and “bar”.

For example:

  • Demonstration 1: “The economy is booming” : bar
  • Demonstration 2: “Stocks are crashing” : bar
  • Demonstration 3: “The team won the game” : foo
  • Query: “The player scored a goal” : [prediction]

If the model predicts “foo,” it isn’t using prior knowledge (because “foo” implies nothing about sports). It must be looking at the context. This setup allowed the authors to mechanically trace exactly how the model moves information from the demonstration context to the final output.

Finding the “In-Context Heads”

Transformers are composed of many layers, and each layer has multiple “attention heads.” A standard 7B parameter model might have 32 layers with 32 heads each—over a thousand heads in total. Do they all contribute equally to In-Context Learning?

The researchers utilized causal tracing and intervention methods—essentially turning specific heads off—to see which ones actually impacted the model’s ability to perform the “foo/bar” task.

The 1% Discovery

The results were striking. They found that ICL performance is not distributed across the whole brain of the model. Instead, it relies on a tiny subset of “In-Context Heads”—roughly 1% of all heads (about 12 heads in the models tested).

When they intervened in just these 12 heads:

  • The accuracy dropped from 87.6% to 24.4%.
  • The rest of the heads were largely irrelevant for this specific mechanism.

They further categorized these heads into two groups:

  1. Fooheads: Heads that, when active, specifically increase the probability of the label “foo.”
  2. Barheads: Heads that specifically increase the probability of the label “bar.”

This localization allows us to stop looking at the model as a giant, confusing monolith and focus entirely on what is happening inside these specific heads.

The Core Hypothesis: A Two-Tower System

To understand the authors’ main contribution, we need to look at the anatomy of a single attention head. In a Transformer, an attention head is often described by four matrices: Query (Q), Key (K), Value (V), and Output (O).

The standard explanation is that the Query asks for information, the Key defines what information is available, and the Value is the actual content being passed along.

The authors analyzed the mathematical behavior of these matrices within the identified In-Context Heads and proposed a unified hypothesis, illustrated below:

Figure 1: Hypothesis of ICL mechanism. (a) Shallow layers merge features into label positions and last position. In in-context heads, (b) value-output matrix VO extracts label information. (c) Query matrix Q and (d) key matrix K compute the (e) similarity scores between last position and each demonstration, deciding how much label information is transferred into the last token.

Let’s break down this diagram (Figure 1) step-by-step.

1. Shallow Layers: Feature Merging

Before we even get to the In-Context Heads (which are usually in deeper layers), the model’s shallow layers perform a crucial preprocessing step (Part (a) in the image).

The model aggregates information. The vector at the position of a label (e.g., the word “bar” in the prompt) gathers semantic information from its corresponding sentence. Simultaneously, the vector at the last position (where the model needs to make a prediction) gathers information about the current input text.

2. The Value-Output (VO) Matrices: The “What”

In the deeper In-Context Heads, the Value and Output matrices (VO) act as information extractors.

The authors analyzed the vectors produced by these matrices and projected them into the vocabulary space using the following equation:

Equation 1

This equation essentially asks: “If we translate this vector back into English words, what does it say?”

Table 1 (below) shows the results of this projection. Look at the value rows. The vectors at the “bar” positions (2-value, 5-value) are shouting “BAR, Bars, Baron.” The vectors at “foo” positions (8-value, 11-value) are shouting “foo, Foo.”

Table 1: Top tokens at label positions and last position.

Key Insight: The Value-Output matrices are “dumb” pipes. Their job is simply to hold the label information (“foo” or “bar”). If attention is paid to this position, this matrix ensures that the concept of “foo” or “bar” is copied to the final prediction.

3. The Query-Key (QK) Matrices: The “Two Towers”

If VO matrices provide the content, the Query and Key matrices decide flow. This is where the paper offers a novel “Two-Tower” interpretation.

In machine learning, a “Two-Tower” model is often used for recommendation systems. One tower processes the User, the other processes the Item, and you calculate the similarity (dot product) between them to see if they match.

The authors suggest In-Context Learning works the same way:

  • Tower 1 (Query): Represents the Last Position (the new input sentence we want to classify).
  • Tower 2 (Key): Represents the Label Positions in the demonstrations (which contain the semantic features of the example sentences).

The attention mechanism calculates the similarity between the Query (Input) and the Keys (Demonstrations).

  • If the new input sentence is semantically similar to Demonstration #1, the similarity score is high.
  • The model “attends” to Demonstration #1.
  • The VO matrix from Demonstration #1 (which holds the label “foo”) is unlocked.
  • “Foo” flows into the final prediction.

This implies that ICL is essentially performing Metric Learning inside the attention heads. It is computing a distance metric between your input and the examples provided.

Experimental Evidence: The Shift

To prove this, the researchers flipped the labels. They took a prompt where the correct answer was “foo” and swapped the labels so the answer became “bar.”

If their hypothesis was correct, the “content” (VO) shouldn’t change much, but the “flow” (Attention) should shift dramatically.

Table 4: Logit minus of weighted value-output vectors at “foo”/“bar” positions (fp,bp) in fooheads/barheads (fh,bh) in Llama (first block) and GPT-J (second block).

The data supported this. As shown in the tables, the prediction shift is driven by a massive reduction in attention scores at “foo” positions and an increase at “bar” positions. The machinery doesn’t “re-read” the text; it simply re-weights the similarity, allowing a different label tower to dominate.

Explaining the Biases of ICL

One of the strongest validations of a new theory is its ability to explain previously confusing phenomena. In-Context Learning is known to suffer from Majority Label Bias (preferring labels that appear frequently) and Recency Bias (preferring labels that appear at the end of the prompt).

The “Two-Tower” hypothesis offers clean explanations for both.

1. Majority Label Bias

Why does the model prefer the majority label?

Because the final output is a sum of attention scores. If you have 10 examples of “foo” and 1 example of “bar,” there are 10 “Key Towers” representing “foo.” Even if the similarity matches are mediocre, the sum of 10 mediocre scores often outweighs the score of the single “bar” example.

The authors verified this by creating imbalanced datasets.

Figure 2: Attention scores on foo positions in fooheads and bar positions in barheads, on original dataset and imbalanced dataset in Llama (left) and GPT-J (right).

In Figure 2, we see that when “foo” demonstrations are removed (Imbalanced dataset), the total attention weight on “foo” positions drops significantly (compare the blue and orange bars). The model isn’t “biased” in a psychological sense; it’s simply a summation machine accumulating similarity scores.

2. Recency Bias

Why does the model prefer examples at the end of the prompt?

The authors hypothesize this is due to Positional Embeddings. In a Transformer, every word is tagged with its position (1st, 2nd, 100th).

  • The “Query” is always at the very end (e.g., position 100).
  • The “Keys” for recent examples are at positions 90, 80…
  • The “Keys” for early examples are at positions 10, 20…

Because position numbers are closer, the mathematical similarity between the Query (Pos 100) and a recent Key (Pos 90) is artificially inflated compared to a distant Key (Pos 10). The “position” becomes a feature that the Two-Tower model inadvertently matches on.

The authors tested this by reversing the order of demonstrations (Figure 3).

Figure 3: Attention scores on foo positions in fooheads and bar positions in barheads, on original dataset and reverse dataset in Llama (left) and GPT-J (right).

When the order is reversed, the attention weights shift significantly, confirming that physical location in the prompt dictates attention strength.

This is further visualized in Figures 4 and 5 below, which show how attention varies across different dataset configurations (Original, Imbalanced, and Recency/Reverse). You can see the distinct shifts in attention distribution based on how the prompt is structured.

Figure 4: Attention scores on “foo”/“bar” positions in original, imbalanced, and recency datasets in Llama.

Figure 5: Attention scores on “foo”/“bar” positions in original, imbalanced, and recency datasets in GPT-J.

Engineering Solutions: Reducing the Bias

Armed with this mechanistic understanding, the authors didn’t just stop at explanation—they proposed fixes.

To fix Majority Label Bias: Since the bias is caused by a lower sum of attention weights for the minority class, the authors proposed mathematically boosting the attention score of the minority positions in the In-Context Heads. They introduced a multiplier based on the ratio of demonstration counts.

To fix Recency Bias: Since the bias is caused by positional embeddings inflating similarity, they proposed stripping the positional information from the attention calculation specifically within the In-Context Heads.

The results of these interventions were positive:

Table 7: Accuracy change before/after applying our method in Llama (first block) and GPT-J (second block).

As shown in Table 7, applying the method to fix majority bias reduced the accuracy fluctuation significantly (accuracy change dropped by ~22%). Similarly, removing positional influence reduced recency bias by ~17%.

Conclusion

This paper provides a significant step forward in mechanistic interpretability. It moves us away from viewing In-Context Learning as magic and toward viewing it as a structured algorithm running on silicon.

The Key Takeaways:

  1. Specialization: Only a tiny fraction of heads (~1%) are responsible for In-Context Learning.
  2. Role Separation: Within these heads, Value matrices carry the label (“foo”), while Query/Key matrices decide which label applies.
  3. Metric Learning: The Query/Key interaction functions as a Two-Tower model, calculating the similarity between the current input and previous examples.
  4. Bias is Mechanical: Biases like recency and majority preference are predictable mathematical artifacts of summation and positional encoding.

By understanding these “In-Context Heads” as two distinct towers for metric learning, we not only understand our models better but can actively engineer them to be more robust, fair, and accurate.