Introduction: The “Coffee in Box Z” Problem

Imagine you are given a logic puzzle:

“The coffee is in Box Z, the stone is in Box M, the map is in Box H. What does Box Z contain?”

For a human, this is trivial. You scan the sentence, find “Box Z,” look at what is associated with it (“coffee”), and give the answer. In cognitive science and linguistics, this process is known as binding. You are binding an entity (Box Z) to an attribute (coffee).

For a Large Language Model (LM) like Llama 2 or GPT-4, this process is surprisingly complex. The model processes tokens sequentially. By the time it reaches the question at the end, it must have stored the information “Box Z = coffee” somewhere in its high-dimensional internal state and differentiated it from “Box M = stone.” If the model mixes these up, it hallucinates.

How do LMs physically organize this information? Do they have a specific “file drawer” for the first item and another for the second?

A fascinating research paper, “Representational Analysis of Binding in Language Models,” dives deep into the neural activations of LMs to answer this question. The researchers discovered that LMs solve this problem using geometry. They found a specific, low-rank subspace—a specific direction in the mathematical universe of the model—that encodes the Order of entities. By physically manipulating this subspace, the researchers could reach into the model’s “brain” and force it to swap objects, proving that they have found the neural circuit responsible for tracking order.

In this blog post, we will unpack how they found this “Ordering ID” subspace, visualize what it looks like, and see how they hacked the model’s internal state to control its reasoning.

Background: The Mystery of In-Context Binding

To understand the breakthrough, we first need to understand the difficulty of Entity Tracking.

When an LM reads a sequence like:

Coffee (Attribute 1) -> Box Z (Entity 1)
Stone (Attribute 2) -> Box M (Entity 2)
Map (Attribute 3) -> Box H (Entity 3)

It faces the Binding Problem. It must represent “Coffee” and “Box Z” together, but keep them separate from “Stone.” Previous theories, such as the Binding ID mechanism proposed by Feng and Steinhardt (2023), suggested that LMs assign abstract “tags” (Binding IDs) to pairs to keep them sorted. Think of it like taking a number ticket at a deli: “Coffee” and “Box Z” both hold ticket #1.

However, the question remained: Where is this ticket physically located in the model’s numbers?

The authors of this paper propose that the key is the Ordering ID (OI). This is the sequential index of the entity (e.g., Box Z is the 0th entity, Box M is the 1st, Box H is the 2nd). If we can find where the model stores the number “0”, “1”, or “2”, we have found the mechanism for binding.

Core Method: Hunting for the “OI Subspace”

The researchers hypothesized that within the massive, multi-dimensional cloud of numbers that make up a model’s activation (its internal thought process), there is a specific “direction” that represents Order.

To find it, they used Principal Component Analysis (PCA).

What is PCA in this context?

Imagine the model’s activation for the word “Box Z” is a vector with 4,096 numbers. Some of those numbers represent that it’s a noun. Some represent that it’s a letter. But if the hypothesis is correct, some combination of those numbers represents “I am the first item in this list.”

If you take the activations for hundreds of different entities at different positions and apply PCA, you are essentially asking the math to “find the direction where these points differ the most.” If the order of the items is a dominant feature, PCA will reveal it as a primary direction (or component).

Visualizing the Hidden Geometry

The researchers applied this method to the Llama2-7B model. They extracted internal states from different layers of the network while it processed lists of objects.

Layer-wise OI subspace visualization on Llama2-7B. Figure 1: A visualization of the Ordering ID (OI) subspace across different layers of Llama2-7B. Each dot is an entity. The colors represent their position in the text (Blue = 1st, Red = 2nd, etc.).

Look closely at the image above (Figure 1).

Early Layers (0-7): The colors are jumbled. The model hasn’t “sorted” the information yet.
Middle Layers (8-15): Look at Layer 8. Suddenly, a beautiful structure emerges. The dots organize themselves into distinct clusters or lines based on their color (order). The blue dots (first item) are separated from the red dots (second item).
Late Layers: The structure becomes complex again as the model prepares for output.

This visualization confirms that middle layers are where the “feature engineering” happens. The model actively constructs a representation of order to keep track of the entities.

Experiments: Hacking the Model’s Brain

Finding a pattern is one thing; proving it does something is another. Correlation does not equal causation. Just because the dots line up by color doesn’t mean the model uses that line to solve the puzzle.

To prove causality, the researchers performed Interventional Experiments. They essentially performed “brain surgery” on the model’s activations.

The Logic of Intervention

If the “Order” is encoded as a direction in space (let’s call it the OI Vector), then mathematically adding more of that vector to an entity should make the model think the entity appears later in the list.

Here is the setup:

Input: “The coffee is in Box Z…”
Target: The model thinks “Box Z” is at Index 0.
Hack: We take the activation for “Box Z” and add a vector pointing in the “Ordering Direction.”
Hypothesis: The model should now believe “Box Z” is actually at Index 1 (where Box M was) and answer that it contains “stone” instead of “coffee.”

Diagram illustrating the OI Subspace Extraction and Patching process. Figure 2: The intervention process. By extracting the OI subspace (via PCA) and adding it back into the model (Patching), the researchers aim to shift the model’s output.

Did it work?

The results were striking. By “sliding” the activation along the OI direction, they could systematically force the model to output the attributes of the 2nd, 3rd, or 4th items, even though the text prompt hadn’t changed.

Let’s look at the quantitative data:

Graph showing Logit Difference (LD) changes as intervention steps increase. Figure 3: This graph shows how the model’s confidence changes as we “push” along the Ordering ID direction.

How to read this graph:

The X-axis is the “Step” (how much we pushed along the OI direction).
The Y-axis is the “Logit Difference” (a measure of how likely the model is to pick a specific word).
The Lines: Each colored line represents a different Binding ID (BI). The yellow line (at the bottom) is the original answer (Coffee).
The Result: As we increase the step (move right on the X-axis), the original answer (yellow) drops. The next item (green line/BI_1) shoots up. Push further, and the third item (BI_2) rises.

This is essentially a radio dial for the model’s attention. By turning the “Order Knob,” the researchers could tune the model to focus on the first item, then the second, then the third.

Visualizing the “Flip”

We can also look at the probability distribution of the answers.

Stacked bar chart showing Label Proportion across intervention steps. Figure 4: As the intervention step increases (X-axis), the dominant color in the bar shifts, indicating the model is swapping its answer to the next item in the sequence.

In Figure 4, at Step 0 (no intervention), the bar is mostly yellow (the 0th item). As we move to Step 1, the bar becomes green (the 1st item). At Step 2, it shifts to the next color. This confirms a causal link: The subspace identified by PCA is indeed the mechanism the model uses to track entity order.

Is it just Position IDs?

A skeptic might ask: “Wait, Transformers already have Position IDs (information about which token is 1st, 2nd, 3rd). Are you sure you didn’t just find that?”

This is a crucial distinction. Ordering ID (OI) is about the semantic order of the entities (1st entity, 2nd entity), regardless of how many words are in between them.

To test this, the researchers created a “Filler Word” dataset. They stuffed meaningless text between the entities, like:

“The coffee is… you know… in Box Z, and then the stone is… actually… in Box M.”

This changes the absolute token position (Position ID) but keeps the entity order (Ordering ID) the same.

Correlation chart comparing OI and PI. Figure 5: Spearman’s rank correlation. The blue bars show correlation with Order (OI), while the red bars show correlation with Position (PI).

The results in Figure 5 are definitive. The Principal Component (PC1) correlates almost perfectly with Order (OI, blue bar) and has almost zero correlation with Position (PI, red bar).

This proves the model is smart. It ignores the fluff (filler words) and maintains a dedicated internal counter for the objects that matter.

Does this apply to other models?

Is this just a quirk of Llama 2? The researchers tested their method on other families of models, including Llama 3 and Float-7B (a code fine-tuned model).

Layer-wise OI subspace visualization on Llama3-8B and Float-7B. Figure 6: The same geometric structure appears in Llama 3 and Float-7B. Note the clear separation of colors in the middle layers.

As seen above, the “emergence of order” in the middle layers is a consistent phenomenon across different modern LLMs. Whether it’s Llama 3 or a code-specialized model, they all seem to “learn” to organize data geometrically in this specific way to solve binding tasks. Interestingly, the code fine-tuned model (Float-7B) showed even sharper sensitivity to this subspace, perhaps because coding requires extremely precise variable tracking.

Conclusion

The “Binding Problem” has long been a theoretical puzzle in neural networks. How do you tie “Attribute A” to “Entity A” without getting it mixed up with “Entity B”?

This research provides a concrete, physical answer. LMs create an Ordering Subspace—a low-dimensional geometric structure within their middle layers. They use this subspace to tag entities with their sequential order.

The implications of this are significant for AI interpretability:

Transparency: We can now literally “see” the model sorting a list by looking at a PCA plot of Layer 8.
Control: We can intervene. If a model is confused about which object holds the map, we theoretically know which “knob” to turn in its activation space to fix the reference.
Universality: This mechanism appears to be a fundamental property of how Transformers learn to reason about sequences.

By mapping the geometry of thought, we step closer to understanding not just what LMs output, but how they actually think.

Introduction: The “Coffee in Box Z” Problem#

Background: The Mystery of In-Context Binding#

Core Method: Hunting for the “OI Subspace”#

What is PCA in this context?#

Visualizing the Hidden Geometry#

Experiments: Hacking the Model’s Brain#

The Logic of Intervention#

Did it work?#

Visualizing the “Flip”#

Is it just Position IDs?#

Does this apply to other models?#

Conclusion#