Large Language Models (LLMs) are rapidly expanding their horizons, now capable of processing context windows of a million tokens or more. This unlocks incredible applications — from understanding entire code repositories, to answering nuanced questions about lengthy legal documents, to reasoning across sprawling datasets.

But with great context comes great computational cost.

Consider feeding a 1M-token prompt to a state-of-the-art LLM. Even on a powerful Nvidia A100 GPU, you might have to wait 30 minutes before the model produces the first output token. This initial delay occurs during the pre-filling stage — the process of ingesting the prompt, computing attention over every token, and setting up the key-value (KV) cache for subsequent decoding. The main culprit? The Transformer’s self-attention, whose computation scales quadratically with input length.

Now imagine cutting that wait time from 30 minutes down to just 3 — without sacrificing accuracy or retraining the model. That’s exactly what a recent paper from Microsoft researchers delivers with MInference: a dynamic sparse attention technique that leverages hidden structure in attention patterns to achieve up to 10× speedup in pre-filling.

A chart showing MInference’s 10x latency speedup over FlashAttention-2 and a heatmap demonstrating its high accuracy on the Needle In A Haystack test.

Figure 1: MInference achieves up to a 10× speedup for 1M-token contexts on a single A100 GPU (b) while matching or exceeding full attention accuracy on retrieval-intensive tasks like Needle In A Haystack (a).

In this article, we’ll break down the challenges of long-context inference, the key insights behind MInference, and the impressive experimental results that make it a breakthrough for long-context LLMs.


The Problem: The Long Wait for the First Token

LLM inference involves two stages:

  1. Pre-filling – Process the entire prompt in parallel to compute the KV cache.
  2. Decoding – Autoregressively generate tokens using the cached keys and values.

For short prompts, decoding dominates runtime. But for million-token contexts, pre-filling becomes the bottleneck.

Self-attention requires computing an \(N \times N\) matrix of pairwise token interactions, yielding \(O(N^2)\) cost. At \(N=1{,}000{,}000\), this is intractable without optimization.

A line chart showing that the latency of the Attention component skyrockets with context window size, while the FFN component remains relatively flat.

Figure 2: (a) Breakdown shows attention dominates pre-filling latency. (b) For a 128k context, retaining only the top 4,096 columns (~3%) preserves 96.4% of attention scores — evidence of sparsity. (c) Reusing these column indices for another context drops recall to 83.7%, underscoring that sparsity is dynamic.


Key Insight: Attention is Sparse and Dynamic

Analysis confirms two truths:

  • Sparse: Each token meaningfully attends to only a fraction of other tokens.
  • Dynamic: Sparse patterns shift dramatically depending on the prompt.

In a 128k context, the top 4k columns recall almost all attention mass. But reusing those indices for a different prompt fails. Sparsity can’t be exploited with a fixed mask — it must be predicted on the fly, efficiently.


The Three Structural Patterns of Sparsity

The MInference team discovered that dynamic sparsity isn’t random — it organizes into a small set of recurring geometric patterns in attention matrices.

Visualizations of the three core attention patterns identified by MInference: A-shape, Vertical-Slash, and Block-Sparse.

Figure 3: (a) Attention matrix visualizations reveal three general patterns. (b) Distances to nearest non-zero neighbors confirm spatial clustering. (c) These patterns deliver higher recall per FLOP on GPUs compared to generic Top-K sparsity.

The three patterns:

  1. A-shape
    Static structure: strong attention on early tokens (global context) and the recent window (local context). Effective for foundational + local cues.

  2. Vertical-Slash (VS)
    Dynamic structure: vertical lines (specific important tokens anywhere in the sequence) combined with diagonal “slash” lines (periodic relative positions). Both vary with prompt content.

  3. Block-Sparse
    Highly dynamic yet clustered: important tokens appear in contiguous blocks. Despite scattered positions, spatial clustering makes block-based computation efficient.

Identifying these high-level geometries transforms the problem from “find important tokens among a million” to “locate the lines or blocks for this head and prompt.”


How MInference Works

MInference achieves acceleration through a three-stage pipeline:

A schematic showing the three sparse methods in MInference, from the more static A-shape to the dynamic Vertical-Slash and Block-Sparse patterns.

Figure 4: The three sparse attention methods used in MInference — arranged from static (A-shape) to increasingly dynamic (VS, Block-Sparse).

1. Offline: Kernel-Aware Pattern Assignment

Each attention head is analyzed once, offline. A search across all three patterns and configurations determines which yields the highest recall of full-attention results per unit GPU cost.
Importantly, this is kernel-aware — measuring actual runtime FLOPs in GPU kernels, not just theory.

2. Online: Dynamic Index Building

At inference, MInference performs ultra-light approximations to locate the dynamic parts of VS and Block-Sparse heads:

  • Vertical-Slash: Multiply only the last 64 query vectors against all keys. This partial attention map identifies top-\(k\) vertical and slash indices cheaply.
  • Block-Sparse: Mean-pool Q and K into blocks of 64, then compute a small block-level attention matrix to select top-\(k\) blocks.
  • A-shape: No approximation needed — fixed window.

Approximation overhead is only 5–20% of total computation.

3. Sparse Attention via Custom Kernels

Sparse masks are passed to custom Triton- and FlashAttention-based kernels optimized to skip irrelevant attention regions — computing only the selected lines or blocks.


Experimental Results

Accuracy Maintained — or Improved

Across benchmarks, MInference matches or slightly outperforms full attention.

InfiniteBench (avg. 214k tokens):
MInference leads all methods, retaining retrieval accuracy where StreamingLLM variants collapse.

Table showing MInference’s performance on the InfiniteBench benchmark compared to baselines.

Table 2: MInference matches/exceeds LLaMA-3 full attention accuracy, far outperforming fixed sparse baselines.

RULER:
MInference extends effective context windows for models like LLaMA-3-8B and GLM-4-9B — outperforming full attention at longer lengths.

Table showing MInference’s performance on the RULER benchmark.

Table 3: In RULER, MInference preserves long-context QA and multi-hop performance beyond 32k–64k tokens.

Needle In A Haystack:
Unlike fixed-window sparse methods, MInference retrieves needles anywhere in a million-token haystack.

A heatmap showing StreamingLLM failing the Needle in a Haystack test when the needle is placed outside its attention window.

Figure 6: StreamingLLM fails when the needle is outside its static window. MInference retains coverage across the entire context.


Ablation Studies: The Power of Three

Using only one pattern type or static indices reduces performance drastically, especially on dynamic retrieval.

Table from the ablation study showing that removing any of the three patterns hurts performance.

Table 4: Removing patterns or using static masks hurts accuracy, proving the need for the combined dynamic approach.


Speedup: 10× at 1M Tokens

Latency gains scale with context size:

  • 100k tokens: 1.8× faster
  • 500k tokens: 6.8× faster
  • 1M tokens: 10× faster — 30 minutes down to 3 minutes.

Since kernels are in Triton, the method ports easily to GPUs like H100 or MI300X.


Conclusion: Breaking the Bottleneck in Long-Context LLMs

Self-attention’s quadratic cost has long been the Achilles’ heel of long-context LLMs.
MInference offers an elegant fix for the pre-filling stage:

  • Structural insight: Dynamic sparsity organizes into A-shape, Vertical-Slash, Block-Sparse.
  • Efficient design: Offline pattern assignment + lightweight online approximation.
  • Massive gains: Up to 10× speedup with full accuracy.

By dramatically cutting pre-filling latency — with zero retraining — MInference makes million-token interactions not only feasible, but fast. As LLMs push into multi-million-token territory, techniques like this will be crucial for unlocking their real-world potential.