The race for larger context windows in Large Language Models (LLMs) is one of the most exciting developments in AI. We have moved rapidly from models that could barely remember a few paragraphs to systems like GPT-4 and Gemini 1.5 that can process entire novels, codebases, or legal contracts in a single prompt.

However, this capability comes with a massive computational cost. The bottleneck is often memory. Specifically, the Key-Value (KV) Cache.

As an LLM generates text, it must store the “Key” and “Value” representations of every previous token to avoid re-computing them. As the context length grows, this cache balloons in size, consuming gigabytes of high-bandwidth memory (HBM). This limits the batch size, slows down generation, and makes deploying long-context models prohibitively expensive for many users.

To solve this, researchers have looked for ways to “compress” this cache—to throw away information we don’t need. But how do you know what to throw away without breaking the model?

In this post, we are diving deep into a fascinating paper titled “A Simple and Effective \(L_{2}\) Norm-Based Strategy for KV Cache Compression.” The authors propose a counter-intuitive but remarkably effective solution: we can identify the most important tokens simply by looking at the magnitude (\(L_2\) norm) of their Key embeddings.

The punchline? The “quieter” the token (lower norm), the more important it is.

The Problem: The KV Cache Bottleneck

Before we get into the solution, let’s establish the context. In decoder-only Transformers (like Llama, GPT, etc.), the attention mechanism computes the relationship between the current token being generated (the Query) and all previous tokens (the Keys and Values).

To generate the next token \(x_{n+1}\), the model needs to attend to \(x_1\) through \(x_n\). Rather than passing \(x_1...x_n\) through the deep neural network every single time, we calculate their Key and Value vectors once and store them in the KV Cache.

Why Compression is Hard

The standard approach to compressing this cache is eviction. We want to keep the “important” tokens and evict the “unimportant” ones. But what defines importance?

Intuitively, a token is important if the model pays a lot of attention to it. Many existing compression methods (like H2O or Scissorhands) look at the attention scores. If a token typically receives low attention scores, it is evicted.

However, there is a catch. Modern inference engines use FlashAttention, an algorithm that speeds up attention calculation by avoiding the materialization of the full attention matrix in high-bandwidth memory. If your compression algorithm requires you to inspect the attention scores to decide what to delete, you effectively break the optimization that FlashAttention provides. You are forced to compute scores you were trying to avoid.

The authors of this paper ask a critical question: Can we estimate the importance of a KV pair without computing the attention score?

The Surprising Discovery: Norms vs. Attention

The researchers analyzed the internal representations of models like Llama-2-7b. They looked specifically at the Key Embeddings stored in the cache.

They found a strong, consistent correlation between the \(L_2\) norm (the Euclidean magnitude) of a Key vector and the attention score it eventually receives.

Figure 2: Five heads at layer 9 of Llama2-7b. Attention score (top) and L2 norm (bottom) are highly correlated.

Look closely at Figure 2 above. This visualization compares the attention scores (top row) and the \(L_2\) norms (bottom row) for five different attention heads in Layer 9 of Llama-2.

  • Top Row (Attention): The bright spots indicate tokens that the model is paying close attention to.
  • Bottom Row (\(L_2\) Norm): The dark spots indicate tokens that have a low vector magnitude.

Notice the pattern? The tokens with the highest attention (brightest top) have the lowest \(L_2\) norm (darkest bottom).

This is somewhat counter-intuitive. In many areas of deep learning, we associate larger activation magnitudes with “stronger” signals. Here, the opposite is true. The “quietest” vectors are the ones acting as attention sinks—anchors that the model focuses on heavily.

Quantifying the Correlation

To prove this wasn’t just a visual coincidence, the authors defined a metric called ALR (Attention Loss Reference).

First, they define the Attention Loss (\(\mathcal{L}\)) caused by compressing the cache. If you drop \(m\) tokens, the loss is the sum of the attention scores those dropped tokens would have received if they were kept.

Equation 1: Attention Loss definition

Here, \(a_{l,h,p}\) is the attention score of the \(p\)-th token.

Next, they calculate the difference (\(\mathcal{Y}\)) between the loss caused by their method (dropping high norms) and the loss caused by an “ideal” oracle that cheats by knowing the exact attention scores.

Equation 2: ALR Difference Calculation

Finally, they sum this over different compression amounts to get the ALR score for each head in the model.

Equation 3: Summing the ALR

A low ALR value means the \(L_2\) norm method is very close to the ideal method. A high value means the correlation is weak.

Figure 1: ALR values across heads and layers in Llama2-7b.

Figure 1 shows this ALR score for every head in every layer of Llama-2-7b.

  • Purple areas: Low ALR (High correlation). This means \(L_2\) norm is a great predictor of importance.
  • Red/Orange areas: High ALR (Low correlation).

Key Takeaway from the Heatmap:

  1. Most layers (the vast sea of purple) show a very strong correlation.
  2. Layers 0 and 1 (the very bottom) and some heads in the middle (Layer ~10-15) show lower correlation. This suggests that for most of the network, we can safely compress based on \(L_2\) norm, but we might want to be careful with the very first few layers.

The Proposed Method: Keep Low Norm

Based on this discovery, the strategy is incredibly simple. It requires no training, no fine-tuning, and no calculation of attention matrices.

The Algorithm:

  1. When the KV cache reaches a size limit (budget), look at the Key vectors currently in the cache.
  2. Calculate the \(L_2\) norm of each Key vector.
  3. Keep the top-\(k\) tokens with the lowest norms.
  4. Evict the tokens with the highest norms.

This heuristic allows the compression to happen entirely based on the static properties of the Key embeddings. It is fully compatible with FlashAttention because it doesn’t need the intermediate attention probability matrix.

Why does this work? The “Sink Token” Hypothesis

Why would a smaller vector attract more attention? The authors hypothesize that this relates to the phenomenon of Attention Sinks, a concept explored in previous research (like StreamingLLM).

Models often dump massive amounts of attention onto specific tokens (like the start-of-sentence <s> token or punctuation) effectively using them as a “no-op” or a resting place when no other token is relevant.

The authors dug deeper by analyzing the specific dimensions of the embeddings.

Figure 25: Key projections showing peaked activations

As shown in Figure 25 (and Figure 6 in the paper), the “important” tokens (like the BOS token) often have sparse activations. Their embeddings are mostly near zero, but have massive spikes in specific dimensions.

When the authors tried “zeroing out” these specific peaked dimensions, the attention maps changed drastically. When they zeroed out random dimensions, nothing happened. This suggests that low-norm vectors aren’t “weak”; they are highly specialized vectors that align perfectly with specific query directions, triggering massive attention scores despite their low overall magnitude.

Experimental Results

The theory is sound, but does it work in practice? The authors tested the method on Language Modeling (perplexity) and rigorous Long-Context tasks.

1. Language Modeling (Perplexity)

They tested Llama-2, Llama-3, and Gemma on the Wikipedia dataset. They capped the KV cache at 2,000 tokens (even as the input grew much longer) and compared different eviction strategies.

Figure 3: Perplexity comparison across models

In Figure 3, the results are stark:

  • Red Line (No Compression): The baseline.
  • Green Line (Keep High Norm): This is the opposite of the proposed method. Perplexity explodes (higher is worse). This proves that high-norm tokens are indeed the least important.
  • Blue Line (Keep Low Norm): This is the proposed method. It tracks the baseline performance almost perfectly, even when compressing the cache significantly.

2. Long-Context Pressure Tests

Perplexity is a rough metric. The real test of a long-context model is retrieving specific information buried deep in the text.

Needle-in-a-Haystack: The model is given a very long text (the haystack) with a specific random fact (the needle) hidden somewhere inside. It must answer a question about that fact.

Figure 4: Needle-in-a-haystack and Passkey Retrieval scores

In Figure 4(a), we see the accuracy on Needle-in-a-haystack for Llama-2-7b-80k:

  • Keep Low Norm (Blue): Maintains nearly 100% accuracy even at 50% compression ratios. It significantly outperforms random eviction (Green).
  • Keep High Norm (Dark Blue): Fails almost immediately.

Passkey Retrieval: A similar task where the model must retrieve a specific numeric passkey. In Figure 4(b), the method maintains 100% accuracy even when compressing the cache by 90%. This is a massive reduction in memory usage with zero loss in utility for this specific task.

We can look at a more detailed breakdown of the Passkey task below:

Figure 11: Detailed Passkey Retrieval Results

Figure 11(b) shows the passkey retrieval for Llama-2-7b-longlora. The accuracy remains perfect (flat line at 1.0) until the compression ratio becomes extremely aggressive (dropping >80% of tokens).

3. Comparison with FastGen

FastGen is a state-of-the-art compression method, but it relies on attention profiling (calculating attention scores) to decide what to keep. This makes it incompatible with FlashAttention.

The authors compared their simple \(L_2\) norm strategy against FastGen (configured to run without full attention scores for fairness).

Figure 5: Comparison with FastGen

Figure 5 shows that on Llama-3-8b, the \(L_2\) norm strategy (Light Blue) yields lower perplexity than FastGen (Navy Blue), while being computationally simpler and easier to integrate into modern inference stacks.

Ablation Study: Should We Skip Layers?

Recall the heatmap in Figure 1 that showed Layers 0 and 1 had a low correlation between Norm and Attention? The authors investigated whether we should skip compression for those specific layers.

Figure 10: Skipping compression at different layers

The results in Figure 10 are nuanced.

  • Graph (c) and (d): Skipping the first two layers (0 and 1) helps maintain accuracy slightly better than compressing everything indiscriminately, especially when the compression ratio is high (Max KV 1000).
  • However, for moderate compression, the difference is negligible.
  • Takeaway: A safe default is to compress all layers, but for maximum performance stability, keeping the first 2 layers uncompressed is a smart optimization.

Conclusion and Implications

The paper “A Simple and Effective \(L_{2}\) Norm-Based Strategy for KV Cache Compression” provides a refreshing insight in a field often dominated by increasing complexity.

Instead of training new modules or computing expensive attention matrices during inference, we can rely on a fundamental geometric property of the embedding space: Magnitude correlates with redundancy.

Key Takeaways:

  1. Low Norm = High Importance: Key embeddings with small \(L_2\) norms are the ones the model attends to most.
  2. FlashAttention Compatibility: Because this method only looks at Key vectors (which are static once computed), it doesn’t require the Attention Matrix. This allows it to ride the speed benefits of FlashAttention.
  3. Massive Savings: The experiments show we can often reduce the KV cache by 50% to 90% with minimal performance degradation on retrieval tasks.

This strategy democratizes long-context inference. By significantly lowering the VRAM requirements, it allows longer contexts to fit on consumer hardware and increases the throughput of serving systems in production. It turns out, sometimes the best way to manage memory is just to check who is being the quietest.


This blog post summarizes the research by Devoto et al. If you are interested in the deeper mathematical proofs or additional visualizations, I encourage you to read the full paper.