The Transformer architecture is the powerhouse behind today’s AI revolution, but it has one stubborn bottleneck: the attention mechanism. As we push for larger models that can process entire books, massive codebases, or hours of video, the quadratic complexity of attention becomes a major computational obstacle. Simply put, the longer the input, the more the attention mechanism struggles—and cost skyrockets.
This scaling issue has sparked intense innovation in making attention faster and more efficient. A few years back, FlashAttention appeared as a breakthrough: by cleverly managing memory I/O on GPUs, it delivered exact attention at high speed without resorting to approximations. Its successor, FlashAttention-2, improved parallelism and load balancing—but even then, on cutting-edge NVIDIA H100 GPUs, it achieved only ~35% of the hardware’s theoretical maximum throughput.
Enter FlashAttention-3. Developed by researchers at Colfax Research, Meta, NVIDIA, Princeton, and Together AI, this new iteration rethinks the algorithm from the ground up to harness Hopper GPU architecture. The result? A 1.5–2× speedup over its predecessor, near-peak GPU utilization, and accurate computation using fast low-precision FP8.
In this article, we’ll walk through the three game-changing ideas behind FlashAttention-3:
- Producer–Consumer Asynchrony: Warp-specialized software pipelining that overlaps data movement with computation.
- Overlapping GEMMs and Softmax: Hiding the latency of slow operations like
exp()
under high-throughput matrix multiplies. - Hardware-Accelerated Low-Precision: Making FP8 both fast and accurate through smart quantization and data layout tricks.
Background: How Attention Works and What Modern GPUs Offer
Before diving into FlashAttention-3’s innovations, let’s revisit the mechanics of multi-head attention and the GPU features this work exploits.
Multi-Head Attention 101
An attention head takes three input matrices:
- Q (Query)
- K (Key)
- V (Value)
For a sequence length \( N \) and head dimension \( d \):
Score Calculation:
\[ \mathbf{S} = \alpha \mathbf{Q} \mathbf{K}^\mathsf{T} \]where \(\alpha = 1/\sqrt{d}\).
Softmax:
\[ \mathbf{P} = \operatorname{softmax}(\mathbf{S}) \]Value Aggregation:
\[ \mathbf{O} = \mathbf{P} \mathbf{V} \]
Figure: Forward pass formulas for standard self-attention.
During training, the backward pass computes gradients for Q, K, and V using intermediate values from the forward pass.
Figure: Backward pass formulas for self-attention.
A straightforward GPU implementation computes these steps sequentially, storing intermediate results S and P in slow global memory (HBM). This is exactly what the original FlashAttention avoided—fusing operations into a single kernel that keeps data in fast on-chip memory.
NVIDIA Hopper GPU Capabilities
FlashAttention-3 is optimized for NVIDIA’s Hopper architecture (H100 GPU), which introduces key features:
Memory hierarchy:
Global memory (HBM) is large but slow; the L2 cache sits between HBM and the Streaming Multiprocessor (SM); each SM contains Shared Memory (SMEM) for rapid on-chip data access; individual threads have ultra-fast private registers (RMEM).
Table: NVIDIA H100 thread-memory hierarchy.
Asynchronous execution:
Hopper has specialized units:
- Tensor Cores with the
WGMMA
(Warpgroup MMA) instruction for large, asynchronous matrix multiplications. - Tensor Memory Accelerator (TMA) for asynchronous data transfer between HBM and SMEM.
Both can run independently of the main CUDA cores, enabling sophisticated overlaps between computation and data movement.
Warp specialization:
Inside a thread block, warps (groups of 32 threads) can be assigned roles. “Producer” warps issue TMA loads; “Consumer” warps perform WGMMA computations. This role separation helps hide memory latency and improve scheduling.
Low-precision FP8:
Hopper doubles Tensor Core throughput with FP8, but imposes strict operand layouts and requires careful quantization to keep accuracy.
FlashAttention-2 didn’t fully exploit these hardware advances—FlashAttention-3 does.
FlashAttention-3’s Three Breakthroughs
1. Producer–Consumer Asynchrony with Pingpong Scheduling
FlashAttention-3 organizes warps into:
- Producers: Load K and V tiles from HBM to a circular SMEM buffer using TMA.
- Consumers: Process Q, K, and V using WGMMA for GEMMs and CUDA cores for softmax.
While consumer warps compute \(\mathbf{Q} \mathbf{K}^\mathsf{T}\) for block \( j \), producer warps prefetch K and V for \( j+1 \). This overlapping hides load latency behind computation.
The team further improves utilization with pingpong scheduling: softmax from one warpgroup runs concurrently with GEMMs from another, keeping the Tensor Cores busy even during slower softmax operations.
Figure: Pingpong scheduling—softmax latency hidden under another group’s GEMMs.
2. Intra-Warpgroup GEMM–Softmax Overlap
Even within a warpgroup, standard execution leaves Tensor Cores idle during softmax. FlashAttention-3 pipelines work across iterations:
During iteration \( j \):
- Stage 1 (Next iteration): Issue GEMM 1 for \( j+1 \): \(\mathbf{S}_{\text{next}} = \mathbf{Q}_i \mathbf{K}_{j+1}^\mathsf{T}\).
- Stage 2 (Current iteration): Issue GEMM 2 for \( j \): \(\mathbf{O}_i \leftarrow \mathbf{O}_i + \mathbf{P}_{\text{cur}} \mathbf{V}_j\).
- Perform softmax on \(\mathbf{S}_{\text{next}}\) while both GEMMs run asynchronously.
Figure: 2-stage pipeline—softmax for one step overlaps with GEMMs from two iterations.
This raises utilization but demands more registers to hold intermediate states—a trade-off between pipeline depth and tile size.
3. FP8 Done Right: Fast and Accurate
Efficiency: Handling Layout Constraints
FP8 Tensor Cores require V in “k-major” layout for GEMM 2, but inputs are usually “mn-major.” FlashAttention-3 performs an in-kernel transpose during tile load. Producer warps use LDSM
/STSM
instructions to load and store SMEM in a transposed arrangement, avoiding costly global operations.
A second layout mismatch exists between an FP8 WGMMA’s FP32 accumulator output and the FP8 operand format for the next WGMMA.
Figure 3: FP32 accumulator register layout after FP8 WGMMA.
Figure 4: Required FP8 operand layout for next WGMMA.
Byte-permute instructions “swizzle” accumulator data to match operand layout, enabling back-to-back FP8 GEMMs.
Accuracy: Block Quantization & Incoherent Processing
FP8’s limited precision makes quantization error—especially from “outlier” activations—problematic.
FlashAttention-3 uses:
- Block Quantization: One scale per processed tile of Q, K, V, adapting locally to value ranges.
- Incoherent Processing: Multiplying Q and K by a random orthogonal matrix (e.g., fast Hadamard) before quantization spreads out outliers across dimensions, making them easier to compress without loss.
Experimental Results
Speed
Benchmarks on H100 GPUs show:
- FP16: 1.5–2× faster forward pass than FlashAttention-2; up to 740 TFLOPs/s (~75% of H100 peak); backward pass 1.5–1.75× faster.
- Beats original FlashAttention, Triton, and even cuDNN at long sequence lengths.
Figure: FP16 forward pass speed—FlashAttention-3 leads across lengths.
Figure: FP16 backward pass speed—significant gains over baselines.
FP8: Forward pass approaches 1.2 PFLOPs/s, competitive with cuDNN and faster at long sequences.
Figure: FP8 forward pass speed—near-peak throughput.
Ablation Study
Disabling pipelining or warp specialization slows performance:
Table: Both pipelining and warp-specialization are critical for top speed.
Accuracy
Outlier-heavy synthetic data is generated from:
\[ \mathcal{N}(0,1) + \mathcal{N}(0,100) \cdot \mathrm{Bernoulli}(0.001) \]Equation: Mixed Gaussian + rare large values to simulate LLM outliers.
Results:
- FP16: FlashAttention-3 matches FlashAttention-2; both are 1.7× more accurate than standard attention (thanks to FP32 softmax).
- FP8: FlashAttention-3 is 2.6× more accurate than per-tensor FP8 baseline.
Table: Numerical error—FlashAttention-3 delivers high FP8 accuracy.
Conclusion: Lessons from FlashAttention-3
FlashAttention-3 exemplifies hardware-aware algorithm design:
- Asynchrony drives utilization: Actively managing producer–consumer roles and pipelined execution maximizes Tensor Core work.
- FP8 can be both fast and precise: Block quantization and incoherent processing tame outliers without losing speed gains.
- Even mature kernels can improve: A 1.5–2× speedup for attention means faster training, quicker inference, and room for larger contexts.
By marrying deep knowledge of GPU microarchitecture with algorithmic redesign, FlashAttention-3 pushes attention—already highly optimized—even closer to the limits of modern hardware. This advance will benefit countless Transformer models and inspire future hardware–software co-design. The path forward for AI is not just bigger models—it’s smarter ways to run them.