Large Language Models (LLMs) are rapidly expanding their horizons, now capable of processing context windows of a million tokens or more. This unlocks incredible applications — from understanding entire code repositories, to answering nuanced questions about lengthy legal documents, to reasoning across sprawling datasets.
But with great context comes great computational cost.
Consider feeding a 1M-token prompt to a state-of-the-art LLM. Even on a powerful Nvidia A100 GPU, you might have to wait 30 minutes before the model produces the first output token. This initial delay occurs during the pre-filling stage — the process of ingesting the prompt, computing attention over every token, and setting up the key-value (KV) cache for subsequent decoding. The main culprit? The Transformer’s self-attention, whose computation scales quadratically with input length.
Now imagine cutting that wait time from 30 minutes down to just 3 — without sacrificing accuracy or retraining the model. That’s exactly what a recent paper from Microsoft researchers delivers with MInference: a dynamic sparse attention technique that leverages hidden structure in attention patterns to achieve up to 10× speedup in pre-filling.
Figure 1: MInference achieves up to a 10× speedup for 1M-token contexts on a single A100 GPU (b) while matching or exceeding full attention accuracy on retrieval-intensive tasks like Needle In A Haystack (a).
In this article, we’ll break down the challenges of long-context inference, the key insights behind MInference, and the impressive experimental results that make it a breakthrough for long-context LLMs.
The Problem: The Long Wait for the First Token
LLM inference involves two stages:
- Pre-filling – Process the entire prompt in parallel to compute the KV cache.
- Decoding – Autoregressively generate tokens using the cached keys and values.
For short prompts, decoding dominates runtime. But for million-token contexts, pre-filling becomes the bottleneck.
Self-attention requires computing an \(N \times N\) matrix of pairwise token interactions, yielding \(O(N^2)\) cost. At \(N=1{,}000{,}000\), this is intractable without optimization.
Figure 2: (a) Breakdown shows attention dominates pre-filling latency. (b) For a 128k context, retaining only the top 4,096 columns (~3%) preserves 96.4% of attention scores — evidence of sparsity. (c) Reusing these column indices for another context drops recall to 83.7%, underscoring that sparsity is dynamic.
Key Insight: Attention is Sparse and Dynamic
Analysis confirms two truths:
- Sparse: Each token meaningfully attends to only a fraction of other tokens.
- Dynamic: Sparse patterns shift dramatically depending on the prompt.
In a 128k context, the top 4k columns recall almost all attention mass. But reusing those indices for a different prompt fails. Sparsity can’t be exploited with a fixed mask — it must be predicted on the fly, efficiently.
The Three Structural Patterns of Sparsity
The MInference team discovered that dynamic sparsity isn’t random — it organizes into a small set of recurring geometric patterns in attention matrices.
Figure 3: (a) Attention matrix visualizations reveal three general patterns. (b) Distances to nearest non-zero neighbors confirm spatial clustering. (c) These patterns deliver higher recall per FLOP on GPUs compared to generic Top-K sparsity.
The three patterns:
A-shape
Static structure: strong attention on early tokens (global context) and the recent window (local context). Effective for foundational + local cues.Vertical-Slash (VS)
Dynamic structure: vertical lines (specific important tokens anywhere in the sequence) combined with diagonal “slash” lines (periodic relative positions). Both vary with prompt content.Block-Sparse
Highly dynamic yet clustered: important tokens appear in contiguous blocks. Despite scattered positions, spatial clustering makes block-based computation efficient.
Identifying these high-level geometries transforms the problem from “find important tokens among a million” to “locate the lines or blocks for this head and prompt.”
How MInference Works
MInference achieves acceleration through a three-stage pipeline:
Figure 4: The three sparse attention methods used in MInference — arranged from static (A-shape) to increasingly dynamic (VS, Block-Sparse).
1. Offline: Kernel-Aware Pattern Assignment
Each attention head is analyzed once, offline. A search across all three patterns and configurations determines which yields the highest recall of full-attention results per unit GPU cost.
Importantly, this is kernel-aware — measuring actual runtime FLOPs in GPU kernels, not just theory.
2. Online: Dynamic Index Building
At inference, MInference performs ultra-light approximations to locate the dynamic parts of VS and Block-Sparse heads:
- Vertical-Slash: Multiply only the last 64 query vectors against all keys. This partial attention map identifies top-\(k\) vertical and slash indices cheaply.
- Block-Sparse: Mean-pool Q and K into blocks of 64, then compute a small block-level attention matrix to select top-\(k\) blocks.
- A-shape: No approximation needed — fixed window.
Approximation overhead is only 5–20% of total computation.
3. Sparse Attention via Custom Kernels
Sparse masks are passed to custom Triton- and FlashAttention-based kernels optimized to skip irrelevant attention regions — computing only the selected lines or blocks.
Experimental Results
Accuracy Maintained — or Improved
Across benchmarks, MInference matches or slightly outperforms full attention.
InfiniteBench (avg. 214k tokens):
MInference leads all methods, retaining retrieval accuracy where StreamingLLM variants collapse.
Table 2: MInference matches/exceeds LLaMA-3 full attention accuracy, far outperforming fixed sparse baselines.
RULER:
MInference extends effective context windows for models like LLaMA-3-8B and GLM-4-9B — outperforming full attention at longer lengths.
Table 3: In RULER, MInference preserves long-context QA and multi-hop performance beyond 32k–64k tokens.
Needle In A Haystack:
Unlike fixed-window sparse methods, MInference retrieves needles anywhere in a million-token haystack.
Figure 6: StreamingLLM fails when the needle is outside its static window. MInference retains coverage across the entire context.
Ablation Studies: The Power of Three
Using only one pattern type or static indices reduces performance drastically, especially on dynamic retrieval.
Table 4: Removing patterns or using static masks hurts accuracy, proving the need for the combined dynamic approach.
Speedup: 10× at 1M Tokens
Latency gains scale with context size:
- 100k tokens: 1.8× faster
- 500k tokens: 6.8× faster
- 1M tokens: 10× faster — 30 minutes down to 3 minutes.
Since kernels are in Triton, the method ports easily to GPUs like H100 or MI300X.
Conclusion: Breaking the Bottleneck in Long-Context LLMs
Self-attention’s quadratic cost has long been the Achilles’ heel of long-context LLMs.
MInference offers an elegant fix for the pre-filling stage:
- Structural insight: Dynamic sparsity organizes into A-shape, Vertical-Slash, Block-Sparse.
- Efficient design: Offline pattern assignment + lightweight online approximation.
- Massive gains: Up to 10× speedup with full accuracy.
By dramatically cutting pre-filling latency — with zero retraining — MInference makes million-token interactions not only feasible, but fast. As LLMs push into multi-million-token territory, techniques like this will be crucial for unlocking their real-world potential.