SpAtten: Making Transformers Spartan by Pruning Redundant Language

Introduction: The Unbearable Slowness of Attention

Transformer-based models like BERT and GPT have revolutionized Natural Language Processing (NLP), achieving state-of-the-art results on everything from sentiment analysis to text generation. They can write code, summarize articles, and even hold surprisingly coherent conversations. But this incredible power comes at a steep price: computational cost.

The secret sauce of these models is the attention mechanism, a clever technique that allows them to weigh the importance of different words in a sentence. The problem? Attention has a quadratic complexity, meaning its computational cost grows with the square of the input sentence length. Processing a 100-word sentence is one thing, but processing a 1000-word document is 100 times more expensive.

This inefficiency is a major bottleneck. On a high-end TITAN Xp GPU, generating a 30-token sentence with GPT-2 can take 370 milliseconds — two orders of magnitude slower than classifying an image with a modern CNN. On a low-power device like a Raspberry Pi, this balloons to a painful 43 seconds. This performance barrier prevents these powerful models from being deployed widely on mobile phones and other edge devices.

What if we could make these models more… spartan? What if we could trim the fat, focusing only on the essential computations without sacrificing accuracy? This is the core idea behind SpAtten, a groundbreaking algorithm–architecture co-design from researchers at MIT. SpAtten introduces a suite of techniques to dramatically accelerate attention by exploiting the inherent redundancy in human language. It intelligently prunes away unimportant words and parts of the model on the fly, slashing both computation and memory access.

Cascade token and head pruning removes redundant tokens and heads across layers. For the sentence “As a visual treat, the film is almost perfect.”, SpAtten identifies and removes less important tokens and heads layer by layer, drastically reducing the computational load while correctly classifying the sentiment as “Positive”.

Figure 1: Cascade token and head pruning progressively removes redundancy in tokens and heads across layers without affecting accuracy.

By the third layer of a BERT model, SpAtten might only process the core concepts — “film perfect” — having pruned away 88% of the original work and still producing the correct sentiment classification. In this article, we’ll dive into how SpAtten achieves these efficiency gains through three key innovations: cascade token pruning, cascade head pruning, and progressive quantization.

Background: Why is Attention So Slow?

To understand SpAtten’s solution, we first need to understand the bottleneck in attention-based models.

The Anatomy of a Transformer

NLP models generally fall into two categories:

Discriminative Models (e.g., BERT): Designed to understand and classify text, producing a classification (like “positive” sentiment) or regression score. This is known as the summarization stage, where the model distills the input into a final prediction.
Generative Models (e.g., GPT-2): Designed to produce new text, starting with a summarization stage on the input prompt before entering a generation stage, producing one token at a time.

Diagram showing the architecture of NLP models. BERT uses only a summarization stage, while GPT-2 uses both a summarization and a generation stage to produce new text.

Figure 3: BERT uses only the summarization stage, while GPT-2 uses both summarization and generation stages.

Both types are stacks of identical blocks, each containing attention and feed-forward layers. Attention takes three inputs: Query (Q), Key (K), and Value (V). In simplified terms, Q represents what a word is “looking for,” K represents what a word “offers,” and V is the content. Attention scores are computed as \( Q \times K^T \), normalized via Softmax into probabilities, and used to weight and sum the Values (\(\text{attention\_prob} \times V\)). Every word can thus attend to every other word.

The Motivation for a New Architecture

Where does the cost come from? Let’s look at the profile of GPT-2.

Left: Latency breakdown for GPT-2 on different hardware, showing attention taking over 50% of the time. Right: A breakdown of attention latency itself, revealing that data movement accounts for 73% of the time, while matrix multiplication is only 27%.

Figure 2: Attention accounts for over half of GPT-2’s runtime; most latency comes from data movement rather than arithmetic.

The left chart shows attention taking more than half the total time, even though it’s a small fraction of FLOPs. The right chart reveals that 73% of attention latency is from data movement — splitting heads, reshaping, transposing, etc. This is a memory-bound problem. Simply speeding up computation won’t help much; we need to reduce memory traffic. This is SpAtten’s guiding principle.

The Core Method: SpAtten’s Three-Pronged Attack

SpAtten combines three complementary algorithmic optimizations:

1. Cascade Token Pruning

Natural language is full of redundancy. Consider “It is a very fun movie, I think.” The sentiment is “fun movie”; the rest is fluff.

SpAtten calculates a cumulative token importance score from the attention probabilities: if many tokens attend to a given token, it’s deemed important.

Diagram illustrating cascade pruning across layers. Importance scores are accumulated and a top-k engine chooses tokens and heads to prune.

Figure 4: Cascade pruning uses cumulative importance scores and top-k selection to remove tokens and heads progressively.

These token importance scores are accumulated across heads and layers (Figure 5 shows such scores). Tokens with consistently low scores, like “a” or “the,” are pruned.

Heatmap of attention probabilities. Column sums give importance scores; low-score tokens are pruned.

Figure 5: Tokens with low cumulative importance scores are removed.

Once pruned, a token’s Q, K, and V vectors are removed from all subsequent layers. This cascade effect shrinks both attention and FFN workloads as the network goes deeper.

2. Cascade Head Pruning

Multi-head attention lets models attend to different relationship types. Many heads, however, are redundant.

SpAtten scores heads by the magnitude of their output vectors. Heads producing large-magnitude outputs exert more influence and are kept; low-impact heads are pruned in a cascade, shrinking feature dimensionality and complementing token pruning.

3. Progressive Quantization

SpAtten goes further to reduce memory access by quantizing numerical precision, but adaptively.

Two charts showing the relationship between attention probability distribution and quantization error. Flat probability distributions lead to high error; spiky ones lead to low error.

Figure 7: Flat distributions amplify quantization error; spiky distributions suppress it.

The error from quantizing attention scores \( \Delta s \) depends on probability shape. The derivative of Softmax gives:

[ \frac{\partial p_i}{\partial s_j} = \begin{cases} p_i (1 - p_i), & i = j \

p_i p_j, & i \neq j \end{cases} ]

If \(p_i\) is near 1 or 0, error is small. Even probability distributions yield higher error.

SpAtten’s progressive quantization flow:

Flowchart for progressive quantization: fetch MSBs, check distribution; fetch LSBs only if needed.

Figure 6: MSBs fetched first; LSBs fetched only if probability distribution is flat.

Fetch MSBs First (e.g., top 4 bits).
Check Probability Shape — if max probability < threshold → distribution is flat.
Fetch LSBs and recompute if flat; skip if spiky.

This minimizes average memory traffic while keeping precision when needed.

The SpAtten Hardware Architecture

To capitalize on these algorithms, SpAtten is paired with a specialized accelerator.

Overview of the SpAtten hardware pipeline, from QKV fetching to attention output, with top-k and quantization modules.

Figure 8: SpAtten’s fully pipelined architecture integrates pruning and quantization logic.

Pipeline steps:

Top-k Selection: Find most important Ks via cumulative token scores.
Data Fetch: Retrieve K vectors from DRAM.
Q×K Multiply: Parallel multiplier array computes scores.
Softmax & Quantization Check: Decide LSB fetch.
Local Value Pruning: Top-k engine picks important Vs.
Attention×V Multiply: Output attention results.

The High-Parallelism Top-k Engine

Standard sorting for top-k is too slow. SpAtten uses a quick-select-based hardware engine with average \(O(n)\) complexity.

Parallel top-k engine with quick-select partitioning via FIFOs and comparators.

Figure 9: Quick-select partitions input into two FIFOs; repeated pivoting isolates top-k threshold quickly.

This engine avoids becoming a bottleneck, enabling real-time pruning.

Specialized Compute Modules

The Query-Key multiplication module has 512 multipliers and a configurable adder tree for varying vector sizes.

Query-Key multiplication module with large multiplier array and configurable adder tree.

Figure 11: Reconfigurable adder tree supports multiple query/key dimensions at full throughput.

The Softmax–quantization module is tailored for SpAtten’s progressive strategy.

Softmax pipeline (left) and progressive quantization check (right).

Figure 12: Softmax calculation feeds into a threshold-based check for LSB fetch.

Experiments and Results: Putting SpAtten to the Test

SpAtten was evaluated on 30 benchmarks, covering BERT and GPT-2 tasks, against CPUs, GPUs, and other accelerators.

Massive Speedups Over General-Purpose Hardware

SpAtten reduces DRAM access by 10× without accuracy loss.

Speedup and energy efficiency vs. CPU/GPU across BERT and GPT-2 tasks.

Figure 14: Speedups of 162× over TITAN Xp, 347× over Xeon, and >1000× on mobile platforms; energy savings up to 4000×.

Beating Other Accelerators

Against A³ and MNNFast, SpAtten wins on speed, energy efficiency, and capability.

Feature	MNNFast	A³	SpAtten
Cascade Head Pruning	❌	❌	✅
Cascade Token Pruning	❌	❌	✅
Progressive Quantization	❌	❌	✅
Reduce DRAM Access	❌	❌	✅
Accelerate	BERT only	BERT only	BERT & GPT-2
Throughput (GOP/s)	120	221	360
Energy Efficiency (GOP/J)	120	269	382

Why is SpAtten So Fast? Breakdown

Breakdown of speedup contributions from datapath, pruning, top-k, and quantization.

Figure 20: Specialized datapath: +22.1×; cascade pruning: +3.4×; top-k engine: +3×; progressive quantization: +2.8× → total >200×.

What Gets Pruned? Interpreting SpAtten

Pruned tokens in classification tasks are often grammatical filler (“it”, “is”), keeping only semantically strong words. For similarity tasks, key nouns/verbs are retained. In GPT-2 generation, pruning removes irrelevant context to focus on relevant source words.

Cumulative importance scores over layers; important tokens remain constant.

Figure 23: Tokens like “published” and “researcher” stay important, illustrating stable semantic focus.

Conclusion and Implications

SpAtten demonstrates the power of algorithm–hardware co-design for accelerating NLP:

Exploit Sparsity: Leverage redundancy in language to prune tokens and heads.
Cascade Decisions: Maximize savings by propagating pruning across all subsequent layers.
Adaptivity: Match precision to input difficulty via progressive quantization.
Specialized Hardware: Custom modules, especially the top-k engine, are key to realizing gains.

By taking a spartan approach — frugal in computation and memory — SpAtten unlocks modern NLP models for practical use across diverse devices, from datacenter GPUs to smartphones.

Introduction: The Unbearable Slowness of Attention#

Background: Why is Attention So Slow?#

The Anatomy of a Transformer#

The Motivation for a New Architecture#

The Core Method: SpAtten’s Three-Pronged Attack#

1. Cascade Token Pruning#

2. Cascade Head Pruning#

3. Progressive Quantization#

The SpAtten Hardware Architecture#

The High-Parallelism Top-k Engine#

Specialized Compute Modules#

Experiments and Results: Putting SpAtten to the Test#

Massive Speedups Over General-Purpose Hardware#

Beating Other Accelerators#

Why is SpAtten So Fast? Breakdown#

What Gets Pruned? Interpreting SpAtten#

Conclusion and Implications#