Introduction

The landscape of sequence modeling is shifting. For years, the Transformer architecture has reigned supreme, driving the revolution in Large Language Models (LLMs). However, a new contender has emerged: State Space Models (SSMs), most notably the Mamba architecture.

Mamba has generated significant excitement because it solves the Transformer’s biggest bottleneck: the quadratic computational cost of attention. Mamba scales linearly with sequence length, making it a potential “Transformer killer” for processing massive contexts. However, scaling Mamba to billions of parameters still presents a massive computational challenge. To deploy these models in real-world applications, we need to make them more efficient.

In the world of Transformers, a popular technique for efficiency is Token Reduction—removing or merging redundant tokens (like “the” or “um”) so the model processes less data. It seems like a logical step: take the state-of-the-art token reduction methods from Transformers and apply them to Mamba.

But there is a catch. As the researchers behind “Rethinking Token Reduction for State Space Models” discovered, directly applying these Transformer-based techniques to Mamba results in a catastrophic drop in performance.

In this post, we will explore why standard optimization techniques break SSMs, analyze the underlying mechanics of Mamba that cause this sensitivity, and detail the researchers’ novel solution: a Unified Token Reduction (UTR) strategy that successfully makes Mamba lighter and faster without sacrificing its intelligence.

Background: Understanding the Challenge

Before diving into the solution, we need to understand the two main components at play: the State Space Model architecture and the concept of Token Reduction.

The State Space Model (SSM)

Unlike Transformers, which look at all tokens simultaneously (global attention), SSMs process data sequentially, much like Recurrent Neural Networks (RNNs). They map an input sequence \(x(t)\) to an output sequence \(y(t)\) through a hidden state \(h(t)\).

The fundamental continuous system is described by the following equation:

Continuous State Space Equation

Here, \(\mathbf{A}\), \(\mathbf{B}\), and \(\mathbf{C}\) are matrices that define how the state evolves and how it projects to the output. To make this computable on digital hardware, Mamba discretizes this system. It introduces a timescale parameter \(\Delta\) to transform continuous parameters into discrete ones (\(\bar{\mathbf{A}}, \bar{\mathbf{B}}\)). This results in a recurrence formula:

Discretized Recurrence Equation

This recurrence allows the model to “remember” history in the compressed state \(h_t\). Crucially, Mamba can also be trained in parallel using a global convolution, avoiding the slow sequential processing of traditional RNNs during training:

Global Convolution Equation

This dual nature—recurrent inference and parallel training—is Mamba’s superpower. However, the recurrent nature creates a dependency chain. Every state depends on the previous state.

The Concept of Token Reduction

In Natural Language Processing (NLP) and Vision, not all tokens are created equal. In a picture of a dog, the pixels representing the background wall are less important than the pixels representing the dog’s eyes. In a sentence, the core subject is more vital than a filler word.

Token reduction methods generally fall into two categories:

Pruning: Identifying “useless” tokens and simply deleting them (e.g., EViT).
Merging: Identifying similar tokens and combining them into a single representation (e.g., PuMer or ToMe).

These methods work wonders on Transformers. But when applied to Mamba, they fail.

Why Transformer Methods Fail on Mamba

The researchers began their study by applying state-of-the-art Transformer reduction methods—EViT (Pruning) and PuMer (Merging)—to a Mamba-2.8B model. The results were stark.

Performance Drop with Standard Methods

As shown in Figure 1 above, applying EViT (red bar) caused a 20% drop in accuracy. Applying PuMer (green bar) resulted in a 26% drop.

The Analysis

Why is Mamba so fragile compared to Transformers? The researchers identify two main culprits:

1. Unrecoverable Information Loss (Pruning Failure): In a Transformer, if you prune a token, the self-attention mechanism in the next layer might still be able to gather context from other global tokens. In Mamba, the computation is sequential (Equation 2). The hidden state \(h_t\) is an accumulation of history. If you prune a token \(x_t\), you aren’t just losing that word; you are breaking the chain of state evolution. This introduces an information gap that gets amplified as the sequence progresses.

2. Neglect of Importance (Merging Failure): Existing merging strategies (like ToMe or PuMer) act based on similarity. They partition tokens into two groups and merge them strictly based on how much they look alike. They do not check if a token is important before merging it. In SSMs, merging a highly important token into a less important one dilutes the signal required for the sequential update, corrupting the hidden state.

The Solution: Unified Token Reduction (UTR)

To fix this, the researchers proposed a new framework tailored for SSMs. Their method rests on two pillars: identifying importance and a hybrid reduction strategy.

Step 1: Rethinking Token Importance

We cannot randomly prune or merge tokens. We must know which ones carry the “signal.” The researchers analyzed the hidden states \(y\) output by the SSM layer.

SSM Layer Output

They found that a specific metric was the best predictor of importance: the average of clipped feature values. Unlike Transformers, where attention maps provide a clear importance score, Mamba’s hidden states are high-dimensional. The researchers discovered that summing the positive activations (clipping negative values to zero) across the feature dimension \(D'\) yields the best “importance score” \(S\):

Importance Metric Equation

By using \(\max(0, \dots)\), the metric focuses on the presence of strong features rather than the magnitude of negative suppressions.

Step 2: The UTRC Workflow

With the importance score calculated, the Unified Token Reduction by Classification (UTRC) method proceeds in a structured pipeline.

Overview of UTRC Method

As illustrated in Figure 2, the process involves three distinct stages:

Importance Classification: The tokens are ranked by their score \(S\). They are split into two sets:

Set \(M_A\) (Less Important): Candidates to be removed or merged.
Set \(M_B\) (More Important): The “anchors” that must be preserved.

Connection: The system doesn’t just delete Set \(M_A\). It tries to save their information. For every “unimportant” token \(a_i\) in Set \(M_A\), the algorithm finds its most similar “important” counterpart \(f_i\) in Set \(M_B\) using cosine similarity.

Similarity Connection Equation

Unified Reduction: This is where the method shines. Instead of strictly pruning or strictly merging, it uses a hybrid approach controlled by a similarity threshold.

Merging: If an unimportant token \(a_i\) is very similar to an important token \(b_j\), their features are averaged (merged). The history is preserved in the fused token.
Pruning: If an unimportant token isn’t similar enough to any important token, it is pruned.

Step 3: Fine-Grained Design Choices

The researchers discovered that treating every part of the network the same is suboptimal. They introduced a decoupled strategy for the Hidden States and the Residual Connections.

Hidden States: A hybrid of pruning and merging is used (e.g., prune 50% of the candidates, merge the other 50%). This balances removing noise while keeping essential context.
Residual Connections: Mostly merging is used. Residual connections are crucial for gradient flow and carrying information from previous layers. Pruning here is dangerous. By merging tokens in the residual path, the model preserves the signal integrity even as the sequence length shrinks.

Experimental Results

Does this tailored approach work? The results suggest a resounding yes.

Accuracy Recovery

The researchers tested their method on Mamba-2 (1.3B and 2.7B) and original Mamba models across six standard benchmarks (like LAMBADA, HellaSwag, and PIQA).

Comparing the proposed method against the standard EViT and PuMer baselines:

On Mamba-2-1.3B, the proposed method achieved an average accuracy of 54.6% under 20% FLOPs reduction, compared to just 44.2% for EViT.
On Mamba-2-2.7B, the gap was even wider. Under 30% reduction, the proposed method maintained 54.7% accuracy, while existing methods dropped to ~41%.

In many cases, the method managed to reduce computational load significantly while keeping perplexity (PPL) low, meaning the model didn’t get “confused” by the missing tokens.

Efficiency Gains

The primary goal of token reduction is speed and memory savings. The method delivers on both fronts.

GPU Peak Memory: Large sequences consume massive amounts of VRAM. By reducing tokens hierarchically across layers, the method significantly lowers peak memory usage.

GPU Memory Reduction

As shown in Figure 3, for Mamba-2.8B, the method reduces peak memory by up to 40% (at 30% FLOPs reduction). This is a game-changer for deploying these models on consumer-grade hardware or edge devices.

Throughput (Speed): Lowering the token count also translates directly to faster generation speeds.

Throughput Comparison

Figure 4 demonstrates that Mamba-2.8B sees a 1.29x speedup in generation throughput. This acceleration makes the model far more responsive for real-time applications like chatbots or code generation.

Validating Design Choices

The researchers performed ablation studies to prove their specific design choices were necessary.

Importance Metric: They compared their “clipped sum” metric against standard \(L1\) and \(L2\) norms. The “clipped” version (Equation 5) consistently yielded lower perplexity and higher accuracy (Table 3 in the paper), proving that positive activation is a better proxy for importance in SSMs.

Hybrid Strategy: They tested “Prune-Only,” “Merge-Only,” and various hybrid ratios (\(q\)).

Merge-Only worked best for residual connections (preserving information flow).
Hybrid (50/50) worked best for hidden states (balancing noise removal with context retention).

Ablation of Design Choices

As seen in Table 5, the combination (\(q=0.5\) for Hidden, Merge-only for Residual) achieved the highest accuracy (54.7%), validating the decoupled approach.

Conclusion

The transition from Transformers to State Space Models like Mamba represents a shift toward more efficient long-sequence modeling. However, efficiency techniques cannot be simply copy-pasted between architectures.

The failure of standard token reduction on Mamba highlights the unique sensitivity of SSMs: you cannot break the sequential chain without consequences.

By introducing a Unified Token Reduction method that respects token importance and employs a hybrid prune-merge strategy, the researchers have provided a blueprint for the future of efficient SSMs. Their method proves that we can have the best of both worlds: the long-context capabilities of Mamba and the lightweight efficiency of reduced token counts.

As SSMs continue to mature, techniques like this will be essential for moving these models out of research labs and into practical, real-world deployments.

Introduction#

Background: Understanding the Challenge#

The State Space Model (SSM)#

The Concept of Token Reduction#

Why Transformer Methods Fail on Mamba#

The Analysis#

The Solution: Unified Token Reduction (UTR)#

Step 1: Rethinking Token Importance#

Step 2: The UTRC Workflow#

Step 3: Fine-Grained Design Choices#

Experimental Results#

Accuracy Recovery#

Efficiency Gains#

Validating Design Choices#

Conclusion#