Beyond FlashAttention: Making Transformers Even Faster with Dynamic Sparsity

Transformers are everywhere—powering tools from ChatGPT to code completion assistants—but they have a well-known Achilles’ heel: the self-attention mechanism. As you feed a Transformer longer sequences of text, the computation required for attention grows quadratically. Doubling the sequence length means quadrupling the work. This computational bottleneck makes training on very long documents, high-resolution images, or extensive codebases both difficult and expensive.

Researchers have long suspected that much of this work is wasted. In practice, a token only needs to closely attend to a small subset of other tokens. This insight fueled research into sparse attention—methods that skip unnecessary computations. While some approaches rely on fixed patterns, others attempt dynamic, data-dependent strategies.

Theoretically, dynamic sparsity methods are attractive. But in practice, they often trail behind the speed of simply computing the full, dense attention matrix—particularly after FlashAttention arrived. FlashAttention is an I/O-aware GPU-optimized implementation of attention that cleverly minimizes slow memory operations, making vanilla attention blazing-fast. However, it has been limited to the dense, regular structure of standard causal attention. Dynamic sparse patterns, with their irregularity, disrupt FlashAttention’s assumptions, erasing its performance edge.

This is where the paper Faster Causal Attention Over Large Sequences Through Sparse Flash Attention comes in. The researchers extend FlashAttention to handle irregular dynamic sparsity without losing speed. Their method—Sparse Causal Flash Attention (SCFA)—brings the theoretical promise of sparse attention into the realm of high-performance GPU kernels. Using SCFA, they trained language models on sequences of 8k and 16k tokens 2.0× and 3.3× faster, respectively, with no loss in model quality.

Let’s explore how they did it.

The Bottleneck and Past Attempts

The Quadratic Problem

Self-attention calculates a score between every pair of tokens in a sequence. With length \(T\), this becomes a \(T \times T\) matrix of scores—an operation that scales with \(T^2\).

For short sequences, that’s manageable. But at 16,000 tokens, the attention matrix has over 250 million entries. As sequence length grows, attention dominates the model’s runtime.

A stacked area chart showing that the computational cost of attention (orange) grows quadratically and dominates the cost of feed-forward layers (blue) for long sequences.

Figure 9: Quadratic computational cost of self-attention dominates for longer sequences.

The Rise of FlashAttention

Attention wasn’t bottlenecked only by FLOPs—it was also limited by memory access. GPUs have fast but small SRAM, while storing full attention matrices requires large, slower High Bandwidth Memory (HBM). Naive implementations waste most of their time moving data between these memories.

FlashAttention (Dao et al., 2022) sidestepped this by computing attention in small tiles that fit in SRAM, minimizing slow memory operations. This reorganization preserved the exact attention math but delivered >5× speedups.

Causal attention for autoregressive models—where a token can only see previous tokens—fits neatly into a lower-triangular mask. FlashAttention leverages this predictable block structure to stay fast.

Why Dynamic Sparsity Breaks It

Dynamic sparsity removes unnecessary token interactions in real time, via:

Query/Key Dropping (QK-Sparse): Dynamically prune queries and keys judged unimportant.
Hashing (Hash-Sparse): Use techniques like Locality-Sensitive Hashing (LSH) to group similar tokens into “buckets” and compute attention only within these groups (as in the Reformer).

When dropping or reordering tokens based on hashes, the neat causal order is scrambled. Even in compacted tensors, each token must still respect its original position for causality. The irregular masks that result break FlashAttention’s efficiency.

The SCFA approach: preserve FlashAttention’s memory efficiency while accommodating this irregularity.

The SCFA Solution

Sparse Causal Flash Attention builds upon FlashAttention’s tiling but adds awareness of original token indices. Two specialized kernels—one for QK-sparse and one for Hash-sparse attention—allow SCFA to prune entire tiles and apply perfect causal masks even in irregular layouts.

1. QK-Sparse: Efficient Token Dropping

Dropping tokens creates smaller, compacted query/key/value tensors.

Diagram showing how QK-sparse attention drops certain keys and queries (marked in red) to create a smaller attention problem, while Hash-sparse attention groups keys and queries by hash code (colors) to create block-sparse attention.

Figure 1: How QK-sparse (top) and Hash-sparse (bottom) attention modify the attention matrix.

SCFA takes two extra vectors:

q_idx: original positions of queries.
k_idx: original positions of keys.

For each tile \(\mathcal{T}_{i,j}\) (query block \(Q_i\) vs key block \(K_j\)):

Block Pruning: Skip tiles where max(q_idx_i) < min(k_idx_j)—all queries precede all keys.
Element Masking: Within valid tiles, mask future tokens using q_idx and k_idx.

Diagram showing how SCFA computes different tile patterns. Left: Standard FlashAttention. Center: SCFA for QK-sparse attention with irregular causal mask. Right: SCFA for Hash-sparse attention with block-sparse irregular mask.

Figure 2: Tile computation patterns for standard, QK-sparse, and Hash-sparse SCFA.

2. Hash-Sparse: Exact Bucket Attention

In LSH-based attention, queries and keys are sorted into buckets by hash ID, producing a block-diagonal-like structure.

SCFA uses:

q_idx, k_idx: original positions.
q_hash, k_hash: bucket IDs.

It prunes and masks via:

Bucket Pruning: Only compute tiles with matching bucket IDs between query and key blocks.
Causality Pruning: Skip bucket-blocks violating causal order.
Element-level Masking: Within tiles, mask by both causality and bucket match.

This yields exact bucket attention with GPU-friendly efficiency.

Experiments & Results

Hash-Sparse in Benchmarks

They compared:

Naive hash-sparse (compute full matrix, then mask).
FlashAttention.
SCFA Hash-sparse.

A chart showing the runtime of different hash-based attention implementations. The naive implementation (orange) is extremely slow. SCFA (red curves) outperforms FlashAttention (blue dashed) with longer sequences and more buckets.

Figure 3: SCFA Hash-sparse runtime improvements over baselines.

Naive approaches are slow; SCFA quickly offsets sorting costs with quadratic savings. Gains rise with sequence length and bucket count.

Against Reformer attention, SCFA is faster and achieves 100% coverage of bucket collisions, unlike Reformer’s declining coverage.

Two charts comparing SCFA (Hash-sparse, red) with Reformer (green) and FlashAttention (blue). Left: SCFA is faster for long sequences. Right: Reformer coverage drops sharply; SCFA stays at 100%.

Figure 4: SCFA Hash-sparse speed and accuracy vs. Reformer.

Hash-Sparse in LM Training

On OpenWebText2, 122M-parameter Transformers with SCFA Hash-sparse (H-LM) matched or exceeded the perplexity of FlashAttention baselines (F-LM), reaching targets far sooner:

Three plots showing H-LM matches baseline perplexity but trains faster: 2.0× faster at 8k, 3.3× at 16k.

Figure 6: H-LM speedups without quality loss.

8k tokens: 2.0× faster
16k tokens: 3.3× faster

QK-Sparse in Benchmarks

Two plots comparing QK-sparse runtimes. Naive (a) is slow; SCFA (b) outperforms FlashAttention across sparsities, especially for long sequences.

Figure 7: QK-sparse runtime performance.

Naive token dropping only helps at extreme sparsity (>70% dropped). SCFA achieves speedup even at modest sparsity (20–30%).

QK-Sparse in LM Training

Models dropping 30% of QK heads (D-LM) matched baseline perplexity with 1.9× speedup; higher drops yielded larger gains but some quality loss.

Results from LM training with QK-dropping: 30% drop matches baseline perplexity and is 1.9× faster.

Figure 8: Fast and competitive LM training with QK-sparse SCFA.

Conclusion

Sparse Causal Flash Attention is a robust engineering advance, marrying dynamic sparsity’s theoretical efficiency with FlashAttention’s practical speed. By extending the I/O-aware tiling to irregular masks, SCFA unlocks performance for QK-dropping, hash-based, and potentially many other dynamic sparse attention strategies.

While SCFA doesn’t alter the worst-case \(O(T^2)\) complexity, it slashes average-case compute costs, making long-context Transformers far more efficient. It’s a foundational tool for next-generation models—open-source and ready to power more adaptive, resource-efficient attention mechanisms.

The Bottleneck and Past Attempts#

The Quadratic Problem#

The Rise of FlashAttention#

Why Dynamic Sparsity Breaks It#

The SCFA Solution#

1. QK-Sparse: Efficient Token Dropping#

2. Hash-Sparse: Exact Bucket Attention#

Experiments & Results#

Hash-Sparse in Benchmarks#

Hash-Sparse in LM Training#

QK-Sparse in Benchmarks#

QK-Sparse in LM Training#

Conclusion#