Peeking into the Brain of a Learning AI: The Backward Lens

If you have been following the explosion of Large Language Models (LLMs) like GPT and Llama, you are likely familiar with the “Forward Pass.” It is the process where the model takes a prompt, processes it through layers of math, and spits out a prediction. We have gotten quite good at analyzing this phase. Tools like the “Logit Lens” allow us to peek into the middle of the model and see what it is “thinking” at layer 12 vs. layer 24.

But there is a darker, less understood side of Deep Learning: the Backward Pass.

This is how models learn. During training or fine-tuning, the model calculates “gradients”—instructions on how to change its internal numbers to get a better answer. Until now, these gradients have been treated mostly as giant, uninterpretable matrices of noise.

In a fascinating new paper, Backward Lens: Projecting Language Model Gradients into the Vocabulary Space, researchers from Technion and Tel Aviv University have cracked this black box open. They discovered that you can translate these abstract mathematical updates into plain English.

In this post, we will explore how they did it, what “learning” actually looks like inside a Transformer, and how this knowledge leads to a clever trick for editing AI memories without doing any training at all.

The Mystery of the Backward Pass

To understand why this paper is important, we first need to visualize how we usually interpret these models.

In the forward pass, we project hidden states into the “vocabulary space.” If a vector at Layer 10 points towards the concept of “King,” we know the model is thinking about royalty.

But in the backward pass, we are dealing with gradients. These are matrices that tell the weights how much to increase or decrease to minimize error. If you look at a raw gradient matrix, it looks like static on a TV screen. It is high-dimensional and unintuitive.

The researchers asked a simple question: Can we apply the “Lens” technique to gradients? Can we project the instructions for updating the model into the vocabulary space to see what the model is trying to learn?

Figure 1: An illustration depicting the tokens promoted by a single LM’s MLP layer and its gradient during the forward and backward pass.

As shown in Figure 1, the answer is yes. When the model is corrected (e.g., told that “Lionel Messi plays for” should result in “Paris” instead of “Barcelona”), the gradients do not just move numbers randomly. They actively “imprint” the context (“team”) and “shift” the prediction toward the new target (“Paris”).

The Anatomy of a Gradient

To understand how the “Backward Lens” works, we have to get a little technical about what a gradient actually is.

In a Transformer, the heavy lifting is done by Multi-Layer Perceptrons (MLPs). An MLP layer usually consists of two large matrices, let’s call them \(FF_1\) (the first layer) and \(FF_2\) (the second layer).

When we calculate the gradient for one of these matrices (let’s say \(W\)) given a sequence of inputs, we use the chain rule of calculus. The paper highlights a crucial mathematical identity: the gradient matrix is actually the product of the input from the forward pass (\(x\)) and the error signal from the backward pass (often called the Vector-Jacobian Product, or VJP, denoted as \(\delta\)).

\[ \frac{\partial L}{\partial W} = x^{\top} \cdot \delta \]

The Low-Rank Discovery

One of the paper’s key theoretical contributions is proving that these massive gradient matrices are low-rank.

If you have a prompt with 10 tokens (words), the gradient matrix might be size \(4096 \times 16384\). That is huge. However, because the gradient is just a sum of the updates for those 10 tokens, the matrix is actually constructed from just 10 pairs of vectors.

Figure 2: The calculation of gradient matrix by the outer product of input and error.

Figure 2 illustrates this beautifully. Instead of trying to analyze the giant “Gradient” block on the right, we can analyze the components that build it: the forward inputs (\(x\)) and the backward errors (\(\delta\)).

This decomposition is the secret sauce. It means we don’t need to analyze millions of parameters. We just need to look at the spanning set—the specific \(x\) and \(\delta\) vectors associated with the tokens in our prompt.

The Mechanism: Imprint and Shift

So, what happens when we project these \(x\) and \(\delta\) vectors into the vocabulary space? The researchers discovered a two-phase mechanism for how Transformers store knowledge, which they call “Imprint and Shift.”

This mechanism behaves differently for the two matrices in the MLP layer:

\(FF_1\) (The “Key”): The Imprint. The gradient for the first matrix is largely determined by the input text (\(x\)). It tries to “imprint” the current context into the model’s memory. If you are training on a sentence about Messi, \(FF_1\) updates to recognize the input pattern associated with Messi.
\(FF_2\) (The “Value”): The Shift. The gradient for the second matrix is driven by the backward error (\(\delta\)). This vector effectively contains the embedding of the target word. The update subtracts the old prediction and adds the new target.

Figure 3: The Imprint and Shift mechanism of backpropagation.

Figure 3 visualizes this flow.

The Forward Pass (top) processes the input.
The Backward Pass (right) brings the correction (the green “Paris” box).
The gradient for \(FF_2\) takes that “Paris” concept and pushes the weights to produce it.
The gradient for \(FF_1\) effectively “stamps” the input pattern so the model knows when to produce “Paris” in the future.

Seeing It in Action: The “Paris” Experiment

Let’s look at real data. The researchers took a GPT-2 model which creates the output “Barcelona” for the prompt “Lionel Messi plays for.” They then performed a single training step to force it to output “Paris.”

Using the Backward Lens, they visualized what the gradients were actually encoding.

Figure 11: GPT2-medium MLP gradients via our spanning set interpretation.

In Figure 11 (referenced from the paper’s appendix deck), we see a heat map of the projected gradients:

Columns represent the tokens in the prompt (“Lionel”, “Messi”, “plays”, “for”).
Rows represent the layers of the model (from 0 to 46).
Text inside the cells represents the word in the vocabulary that the gradient vector is most similar to.

Look at the right side of the figure (for \(FF_2\)). As we get to the higher layers (top rows), the gradients for the token “for” are overwhelmingly pointing to “Paris”.

The Backward Lens reveals that the gradient is essentially screaming: “Take the weights associated with the word ‘for’ and shove them towards the concept of ‘Paris’!”

This confirms that backpropagation isn’t doing abstract, incomprehensible math. It is performing semantic arithmetic: Current Weights + (Paris Embedding) - (Barcelona Embedding).

Where does the editing happen?

The researchers also analyzed where the model decides to store this new information. Does it update every layer equally?

Figure 16: VJP norm for each of GPT2-xl sub-module.

Figure 16 shows the “intensity” (norm) of the updates across different layers (Y-axis) and token positions (X-axis).

The Last Subject Token (“Messi”) sees a massive spike in updates in the early layers.
The Last Token of the prompt (“for”) sees a spike in the middle/late layers.

This suggests a distinct separation of labor: the model processes the subject early on, and determines the relation/output later.

Application: Editing Without Backprop

Here is the “aha!” moment.

If we know that the backward pass for \(FF_2\) essentially just adds the embedding of the target word (“Paris”) to the weights, why do we need to run the backward pass at all?

Calculating gradients is computationally expensive. It requires storing activations, computing derivatives, and moving massive matrices.

The researchers proposed a new method called “Forward Pass Shift.”

Run the forward pass to get the input vector \(x\) for the last token.
Look up the embedding for the target word (“Paris”) in the model’s own vocabulary matrix. Let’s call this \(d\).
Manually update the weight matrix \(FF_2\) by adding the outer product of \(x\) and \(d\).

That’s it. No derivatives. No calculus. Just one forward pass and one matrix addition.

Does it work?

They compared this “Forward Pass Shift” against state-of-the-art model editing methods like ROME and MEMIT, which involve complex optimizations.

Table 1: GPT2-xl single editing results for CounterFact.

Table 1 shows the results.

Efficacy: The Forward Pass Shift achieved 99.4% success in making the model say the target word. This ties with ROME and beats MEND and MEMIT.
Fluency (N-gram): The model’s ability to generate coherent text remained high (622.45), comparable to the original model.

While it struggled slightly with “Paraphrase” (generalizing the edit to rewording of the prompt), the fact that a simple heuristic derived from interpretability research could match heavy-duty optimization algorithms is stunning. It suggests we might be finding “shortcuts” in how to teach AI.

Conclusion

The “Backward Lens” paper bridges a massive gap in our intuition about Large Language Models. It transforms the backward pass from a mathematical abstraction into a linguistic narrative.

We learned that:

Gradients are Low-Rank: They are composed of simple input and error vectors.
Imprint and Shift: Learning is a process of imprinting inputs (\(FF_1\)) and shifting outputs (\(FF_2\)) toward targets.
Semantic Gradients: When projected, gradients literally look like the words they are trying to teach the model.
Shortcut Learning: We can mimic backpropagation manually by just injecting target embeddings, saving massive computational resources.

As we strive to make AI safer and more aligned, understanding how models learn is just as important as understanding what they know. This work provides a powerful new lens to watch that learning happen in real-time.

The Mystery of the Backward Pass#

The Anatomy of a Gradient#

The Low-Rank Discovery#

The Mechanism: Imprint and Shift#

Seeing It in Action: The “Paris” Experiment#

Where does the editing happen?#

Application: Editing Without Backprop#

Does it work?#

Conclusion#