Curing the Amnesia of State Space Models: Inside the LAMB Architecture

The landscape of Large Language Models (LLMs) is currently dominated by Transformers. However, as anyone who has tried to feed a textbook into a standard chatbot knows, Transformers have a weakness: the “quadratic bottleneck.” As the length of the input text increases, the computational cost explodes. This has led to a surge of interest in State Space Models (SSMs), such as Mamba. SSMs promise a “sub-quadratic” alternative, theoretically allowing models to process massive sequences efficiently.

But there is a catch. While SSMs are fast, they often suffer from a form of “amnesia.” Because they rely on a fixed-size hidden state to compress history, they tend to forget early information as the context grows longer than what they saw during training.

How do we fix this without expensive retraining?

Enter LAMB (Long-context extension driven by Attention-guided token filtering in MamBa). In a recent paper, researchers from Georgia Tech and Intel Labs proposed this novel, training-free framework. LAMB intelligently selects which tokens to keep in memory and which to discard, allowing SSMs to handle significantly longer contexts with higher accuracy.

Comparison of average performance on RULER across various context lengths.

As illustrated in Figure 1, the difference is stark. While vanilla Mamba models (red line) crash to near-zero accuracy as context length increases, LAMB (blue line) maintains robust performance, rivaling and even outperforming previous enhancement techniques.

In this deep dive, we will unpack the mathematics of SSM decay, explore the “hidden attention” mechanism within Mamba, and detail exactly how LAMB filters tokens to preserve long-range dependencies.

The Context: Why SSMs Forget

To understand LAMB, we first need to understand the mechanics of the models it improves. Mamba and similar SSMs process text sequentially. Unlike a Transformer, which looks at all tokens simultaneously (global attention), an SSM looks at one token at a time and updates a running “hidden state.”

The core update rule for a Mamba block looks like this:

Equation describing the hidden state update rule.

Here, \(h_t\) is the hidden state at time \(t\). The matrix \(\bar{A}\) dictates how much of the previous state (\(h_{t-1}\)) is retained, and \(\bar{B}\) dictates how much of the new input (\(x_t\)) is added.

The crucial term here is \(\bar{A}\). In Mamba, this term is derived from a negative matrix, meaning \(\bar{A}\) is usually less than 1. This acts as a decay factor. With every new token processed, the information from previous steps is multiplied by this decay factor. Over long sequences, this repeated multiplication causes early signals to vanish exponentially. This is why a standard Mamba model struggles to answer questions about the beginning of a long book—the mathematical signal has simply eroded.

Previous Attempts: The Limits of LongMamba

Researchers have attempted to solve this. A notable predecessor, LongMamba, introduced the idea of Token Filtering. The logic is simple: if the memory is limited, we shouldn’t save every token. We should only update the hidden state for “important” tokens.

LongMamba decides importance based on the magnitude of the update term \(\Delta_t\). If \(\Delta_t\) is small, the model assumes the token isn’t changing the state much, so it discards it. While this helps, it is a heuristic. It assumes that the magnitude of the update equals the importance of the information. As we will see, this assumption is often flawed.

The Insight: Uncovering Hidden Attention

The creators of LAMB took a step back to ask: What actually constitutes an important token in an SSM?

To answer this, they utilized a method to extract an “attention map” from Mamba. Even though Mamba doesn’t use explicit attention heads like a Transformer, the relationship between input \(x\) and output \(y\) can be unrolled mathematically.

The output at any step \(t\) is calculated as:

Equation for the output y based on hidden state h.

By expanding the recurrent terms, we can express the output as a sum of all previous inputs weighted by their contributions. This reveals an implicit “attention score” \(\alpha_{i,j}\):

Equation showing the unrolled attention summation.

This equation allows us to visualize which past tokens the model is actually “looking at” when generating a new token.

Analyzing the Attention Maps

When the researchers visualized these attention maps, they discovered two critical insights that form the foundation of LAMB.

1. Sparsity is Real: As shown in Figure 2(a) below, the attention map is not uniform. It exhibits a vertical stripe pattern. This means a very small subset of tokens (specific columns) dominates the attention. The model cares deeply about a few key words and largely ignores the rest.

2. Attention > \(\Delta_t\): The researchers tested different filtering strategies against an “Oracle” (the theoretical upper bound). They found that filtering tokens based on their Attention Score (how much the model attends to them) was vastly superior to filtering based on \(\Delta_t\) (the LongMamba method).

Attention map visualization and retention ratio comparison graph.

In Figure 2(b), look at the solid blue line (Attention-Guided). It tracks closely with the dashed red line (Oracle). This proves that if we can accurately calculate the attention score of tokens before we run out of memory, we can safely discard the rest without hurting performance.

The Core Method: LAMB

The challenge is that we need to know a token’s importance before we generate the full sequence. We need a way to estimate the attention score of a token relative to the future generation.

LAMB (Long-context extension driven by Attention-guided token filtering) introduces a pipeline to do exactly this. However, using raw attention scores directly from the standard Mamba equations presents two specific problems: Bias and Noise.

Step 1: Debiased Attention

In standard Mamba attention, the decay factor \(\bar{A}\) accumulates over time. This creates a strong Recency Bias. Tokens that appeared recently have a high score simply because they haven’t decayed yet, not necessarily because they are semantically important. Conversely, critical information from the start of the prompt appears to have a tiny score because it has decayed over thousands of steps.

To fix this, LAMB introduces Debiased Attention. The researchers replace the cumulative decay factor responsible for the bias with a constant factor.

Equation for Debiased Attention.

By removing the time-dependent decay from the measurement, the metric \(\alpha^D\) reveals the intrinsic importance of the token, putting early tokens on an equal footing with recent ones.

Step 2: Contrastive Attention

The second problem is noise. As seen in the visualizations, attention maps can be fuzzy. To make the filtering decision robust, we need to clearly distinguish between “signal” and “noise.”

LAMB applies a Contrastive mechanism. It subtracts a portion of the maximum attention score from the current score and applies a ReLU (rectified linear unit) function.

Equation for Contrastive Attention.

Here, \(\gamma\) is a hyperparameter (usually around 0.9). This operation suppresses minor fluctuations. If a token’s score isn’t close to the peak score (within the margin defined by \(\gamma\)), it gets zeroed out. This acts like a high-pass filter, leaving only the distinct “spikes” of high importance.

The visual impact of these transformations is illustrated in Figure 3. Notice how the “Contrastive Attention” (third row) is much cleaner and sharper than the original noisy attention.

Visual comparison of attention types and the LAMB pipeline.

Step 3: Aggregation and Pooling

SSMs function using multiple “channels” in their hidden states. LongMamba identified that some channels are “Global” (long-term memory) and others are “Local.” LAMB focuses its filtering efforts on these Global Channels.

To get a single importance score for a token \(t\), LAMB sums the contrastive attention scores across all global channels and across the “observation window” (the last few tokens the model saw).

Equation for Raw Importance Aggregation.

Finally, there is one last clever trick. Language is rarely about a single isolated word; it is about phrases and local context. If we just pick single spikes, we might lose the surrounding context that gives a word meaning.

To address this, LAMB applies Mean Pooling to the importance scores.

Equation for Mean Pooling.

This smooths the selection, ensuring that if a token is selected, its immediate neighbors are likely to be selected as well, preserving local semantic integrity.

The Complete Pipeline

The full workflow for inference using LAMB works as follows:

Process the Prompt: The model processes the input text.
Calculate Metrics: For every token, it calculates the Debiased, Contrastive Attention score relative to the most recent “observation window.”
Select Top-K: It identifies the top \(K\) tokens with the highest aggregated importance scores.
Filter & Update:

For the selected Top-K tokens, the model updates the hidden state normally.
For the unselected tokens, the model sets \(\Delta_t = 0\). This forces the hidden state to remain unchanged (\(h_t = h_{t-1}\)), effectively “skipping” the token’s impact on the memory state while preserving the memory of previous important tokens.

This is entirely training-free. You can take a pre-trained Mamba model and apply this logic during inference to instantly boost its long-context capabilities.

Experiments and Results

The researchers evaluated LAMB on two rigorous benchmarks for long-context understanding: HELMET and RULER. They tested it on pure SSMs (Mamba2) and hybrid models (Zamba2).

Performance on RULER

RULER is a synthetic benchmark designed to test precise retrieval over long distances. The results in Table 2 are compelling.

Table 2: RULER benchmark results.

Look at the Avg. column on the far right.

Vanilla Mamba2-1.3B scores a negligible 0.27%. It effectively fails the task completely at 16k context length.
LongMamba improves this to 10.82%.
LAMB jumps to 33.96%.

This represents a roughly 3x improvement over the previous state-of-the-art method. The improvement is particularly strong in “multiquery” and “variable tracking” (vt) tasks, which require maintaining specific pieces of information over long durations without distraction.

Performance on HELMET

HELMET covers diverse tasks closer to real-world applications, such as long-document QA and summarization.

Table 1: HELMET benchmark results.

The trend holds. Across 8k, 16k, and 32k context lengths, LAMB consistently outperforms both the vanilla model and LongMamba. For example, on the 16k context task for Zamba2, LAMB achieves 12.35 compared to LongMamba’s 11.35 and Vanilla’s 6.76.

Why the Components Matter (Ablation)

Is all the complexity of debiasing and pooling really necessary? The ablation study suggests yes.

Table 3: Ablation study.

No Denoising, No Pooling: The accuracy is a mere 3.40%.
Pooling only: Accuracy jumps to 27.22%. This shows that preserving local context around key tokens is vital.
Denoising + Pooling (Full LAMB): Reaches the peak of 33.96%. The debiasing and contrastive steps add the final layer of precision needed for top performance.

Latency Costs

One might worry that calculating these attention matrices would slow the model down. However, because LAMB operates during the “prefill” phase (processing the prompt) and uses efficient operations, the overhead is minimal.

Table 4: Latency overhead.

As shown in Table 4, for very long sequences (192k tokens), the overhead is only about 5.78%. Crucially, there is zero overhead during the generation stage (when the chatbot is actually writing the response), because the filtering has already happened.

Conclusion

The LAMB framework represents a significant maturation in our understanding of State Space Models. It moves us away from heuristic guesses about token importance and towards a principled, attention-based metric.

By debiasing the attention signal and applying contrastive filtering, LAMB allows Mamba models to “see” clearly through the noise of long sequences. It successfully identifies the needle in the haystack—the critical tokens that must be remembered—and discards the rest.

The implications are exciting:

Efficiency: We can run long-context tasks on hardware that previously couldn’t handle the memory requirements.
Usability: Because it is training-free, LAMB can be applied immediately to existing Mamba deployments.
Understanding: This work deepens our theoretical understanding of how SSMs internally represent importance, bridging the gap between the transparency of Transformer attention and the efficiency of Recurrent Neural Networks.

As we strive for LLMs that can read entire libraries of text, techniques like LAMB will be essential in ensuring those models don’t just read, but actually remember.

The Context: Why SSMs Forget#

Previous Attempts: The Limits of LongMamba#

The Insight: Uncovering Hidden Attention#

Analyzing the Attention Maps#

The Core Method: LAMB#

Step 1: Debiased Attention#

Step 2: Contrastive Attention#

Step 3: Aggregation and Pooling#

The Complete Pipeline#

Experiments and Results#

Performance on RULER#

Performance on HELMET#

Why the Components Matter (Ablation)#

Latency Costs#

Conclusion#