Unpacking FlashAttention-3: How Asynchrony and FP8 Supercharge Transformers

The Transformer architecture is the powerhouse behind today’s AI revolution, but it has one stubborn bottleneck: the attention mechanism. As we push for larger models that can process entire books, massive codebases, or hours of video, the quadratic complexity of attention becomes a major computational obstacle. Simply put, the longer the input, the more the attention mechanism struggles—and cost skyrockets.

This scaling issue has sparked intense innovation in making attention faster and more efficient. A few years back, FlashAttention appeared as a breakthrough: by cleverly managing memory I/O on GPUs, it delivered exact attention at high speed without resorting to approximations. Its successor, FlashAttention-2, improved parallelism and load balancing—but even then, on cutting-edge NVIDIA H100 GPUs, it achieved only ~35% of the hardware’s theoretical maximum throughput.

Enter FlashAttention-3. Developed by researchers at Colfax Research, Meta, NVIDIA, Princeton, and Together AI, this new iteration rethinks the algorithm from the ground up to harness Hopper GPU architecture. The result? A 1.5–2× speedup over its predecessor, near-peak GPU utilization, and accurate computation using fast low-precision FP8.

In this article, we’ll walk through the three game-changing ideas behind FlashAttention-3:

Producer–Consumer Asynchrony: Warp-specialized software pipelining that overlaps data movement with computation.
Overlapping GEMMs and Softmax: Hiding the latency of slow operations like exp() under high-throughput matrix multiplies.
Hardware-Accelerated Low-Precision: Making FP8 both fast and accurate through smart quantization and data layout tricks.

Background: How Attention Works and What Modern GPUs Offer

Before diving into FlashAttention-3’s innovations, let’s revisit the mechanics of multi-head attention and the GPU features this work exploits.

Multi-Head Attention 101

An attention head takes three input matrices:

Q (Query)
K (Key)
V (Value)

For a sequence length \( N \) and head dimension \( d \):

Score Calculation:
\[ \mathbf{S} = \alpha \mathbf{Q} \mathbf{K}^\mathsf{T} \]
where \(\alpha = 1/\sqrt{d}\).
Softmax:
\[ \mathbf{P} = \operatorname{softmax}(\mathbf{S}) \]
Value Aggregation:
\[ \mathbf{O} = \mathbf{P} \mathbf{V} \]

The forward pass equations for standard self-attention.

Figure: Forward pass formulas for standard self-attention.

During training, the backward pass computes gradients for Q, K, and V using intermediate values from the forward pass.

The backward pass equations for standard self-attention.

Figure: Backward pass formulas for self-attention.

A straightforward GPU implementation computes these steps sequentially, storing intermediate results S and P in slow global memory (HBM). This is exactly what the original FlashAttention avoided—fusing operations into a single kernel that keeps data in fast on-chip memory.

NVIDIA Hopper GPU Capabilities

FlashAttention-3 is optimized for NVIDIA’s Hopper architecture (H100 GPU), which introduces key features:

Memory hierarchy:
Global memory (HBM) is large but slow; the L2 cache sits between HBM and the Streaming Multiprocessor (SM); each SM contains Shared Memory (SMEM) for rapid on-chip data access; individual threads have ultra-fast private registers (RMEM).

A table showing the thread-memory hierarchy for the NVIDIA H100 GPU, detailing capacity and bandwidth at each level from chip to thread.

Table: NVIDIA H100 thread-memory hierarchy.

Asynchronous execution:
Hopper has specialized units:

Tensor Cores with the WGMMA (Warpgroup MMA) instruction for large, asynchronous matrix multiplications.
Tensor Memory Accelerator (TMA) for asynchronous data transfer between HBM and SMEM.

Both can run independently of the main CUDA cores, enabling sophisticated overlaps between computation and data movement.

Warp specialization:
Inside a thread block, warps (groups of 32 threads) can be assigned roles. “Producer” warps issue TMA loads; “Consumer” warps perform WGMMA computations. This role separation helps hide memory latency and improve scheduling.

Low-precision FP8:
Hopper doubles Tensor Core throughput with FP8, but imposes strict operand layouts and requires careful quantization to keep accuracy.

FlashAttention-2 didn’t fully exploit these hardware advances—FlashAttention-3 does.

FlashAttention-3’s Three Breakthroughs

1. Producer–Consumer Asynchrony with Pingpong Scheduling

FlashAttention-3 organizes warps into:

Producers: Load K and V tiles from HBM to a circular SMEM buffer using TMA.
Consumers: Process Q, K, and V using WGMMA for GEMMs and CUDA cores for softmax.

While consumer warps compute \(\mathbf{Q} \mathbf{K}^\mathsf{T}\) for block \( j \), producer warps prefetch K and V for \( j+1 \). This overlapping hides load latency behind computation.

The team further improves utilization with pingpong scheduling: softmax from one warpgroup runs concurrently with GEMMs from another, keeping the Tensor Cores busy even during slower softmax operations.

A diagram illustrating pingpong scheduling between two warp groups. The softmax of one group is scheduled to run concurrently with the GEMM operations of the other, maximizing hardware utilization.

Figure: Pingpong scheduling—softmax latency hidden under another group’s GEMMs.

2. Intra-Warpgroup GEMM–Softmax Overlap

Even within a warpgroup, standard execution leaves Tensor Cores idle during softmax. FlashAttention-3 pipelines work across iterations:

During iteration \( j \):

Stage 1 (Next iteration): Issue GEMM 1 for \( j+1 \): \(\mathbf{S}_{\text{next}} = \mathbf{Q}_i \mathbf{K}_{j+1}^\mathsf{T}\).
Stage 2 (Current iteration): Issue GEMM 2 for \( j \): \(\mathbf{O}_i \leftarrow \mathbf{O}_i + \mathbf{P}_{\text{cur}} \mathbf{V}_j\).
Perform softmax on \(\mathbf{S}_{\text{next}}\) while both GEMMs run asynchronously.

A diagram illustrating the 2-stage WGMMA-softmax pipeline. It shows how the softmax calculation for one step is overlapped with the matrix multiplication (WGMMA) operations of adjacent steps.

Figure: 2-stage pipeline—softmax for one step overlaps with GEMMs from two iterations.

This raises utilization but demands more registers to hold intermediate states—a trade-off between pipeline depth and tile size.

3. FP8 Done Right: Fast and Accurate

Efficiency: Handling Layout Constraints

FP8 Tensor Cores require V in “k-major” layout for GEMM 2, but inputs are usually “mn-major.” FlashAttention-3 performs an in-kernel transpose during tile load. Producer warps use LDSM/STSM instructions to load and store SMEM in a transposed arrangement, avoiding costly global operations.

A second layout mismatch exists between an FP8 WGMMA’s FP32 accumulator output and the FP8 operand format for the next WGMMA.

A table showing the memory layout of an FP32 accumulator register after a WGMMA instruction.

Figure 3: FP32 accumulator register layout after FP8 WGMMA.

A table showing the required memory layout for an FP8 operand A for a WGMMA instruction.

Figure 4: Required FP8 operand layout for next WGMMA.

Byte-permute instructions “swizzle” accumulator data to match operand layout, enabling back-to-back FP8 GEMMs.

Accuracy: Block Quantization & Incoherent Processing

FP8’s limited precision makes quantization error—especially from “outlier” activations—problematic.

FlashAttention-3 uses:

Block Quantization: One scale per processed tile of Q, K, V, adapting locally to value ranges.
Incoherent Processing: Multiplying Q and K by a random orthogonal matrix (e.g., fast Hadamard) before quantization spreads out outliers across dimensions, making them easier to compress without loss.

Experimental Results

Speed

Benchmarks on H100 GPUs show:

FP16: 1.5–2× faster forward pass than FlashAttention-2; up to 740 TFLOPs/s (~75% of H100 peak); backward pass 1.5–1.75× faster.
Beats original FlashAttention, Triton, and even cuDNN at long sequence lengths.

A series of bar charts comparing the forward pass speed of FlashAttention-3 in FP16 against Standard Attention, FlashAttention-2, Triton, and cuDNN across various sequence lengths and head dimensions. FlashAttention-3 consistently shows the highest performance.

Figure: FP16 forward pass speed—FlashAttention-3 leads across lengths.

Two bar charts comparing the backward pass speed of FlashAttention-3 in FP16 against competitors. The purple bars for FlashAttention-3 are consistently the tallest.

Figure: FP16 backward pass speed—significant gains over baselines.

FP8: Forward pass approaches 1.2 PFLOPs/s, competitive with cuDNN and faster at long sequences.

Bar charts showing the forward pass speed in FP8. FlashAttention-3 is competitive with and often surpasses the highly optimized cuDNN library, especially at longer sequence lengths.

Figure: FP8 forward pass speed—near-peak throughput.

Ablation Study

Disabling pipelining or warp specialization slows performance:

A table showing the results of an ablation study. The full FlashAttention-3 achieves 661 TFLOPs/s, while versions without pipelining or without warp-specialization are slower at 582 and 570 TFLOPs/s, respectively.

Table: Both pipelining and warp-specialization are critical for top speed.

Accuracy

Outlier-heavy synthetic data is generated from:

\[ \mathcal{N}(0,1) + \mathcal{N}(0,100) \cdot \mathrm{Bernoulli}(0.001) \]

The equation for the mixed Gaussian distribution used to generate test data with outliers.

Equation: Mixed Gaussian + rare large values to simulate LLM outliers.

Results:

FP16: FlashAttention-3 matches FlashAttention-2; both are 1.7× more accurate than standard attention (thanks to FP32 softmax).
FP8: FlashAttention-3 is 2.6× more accurate than per-tensor FP8 baseline.

A table comparing the Root Mean Squared Error (RMSE) of different attention implementations. FlashAttention-3 FP16 matches FlashAttention-2, and FlashAttention-3 FP8 is significantly more accurate than the baseline FP8.

Table: Numerical error—FlashAttention-3 delivers high FP8 accuracy.

Conclusion: Lessons from FlashAttention-3

FlashAttention-3 exemplifies hardware-aware algorithm design:

Asynchrony drives utilization: Actively managing producer–consumer roles and pipelined execution maximizes Tensor Core work.
FP8 can be both fast and precise: Block quantization and incoherent processing tame outliers without losing speed gains.
Even mature kernels can improve: A 1.5–2× speedup for attention means faster training, quicker inference, and room for larger contexts.

By marrying deep knowledge of GPU microarchitecture with algorithmic redesign, FlashAttention-3 pushes attention—already highly optimized—even closer to the limits of modern hardware. This advance will benefit countless Transformer models and inspire future hardware–software co-design. The path forward for AI is not just bigger models—it’s smarter ways to run them.

Background: How Attention Works and What Modern GPUs Offer#

Multi-Head Attention 101#

NVIDIA Hopper GPU Capabilities#

FlashAttention-3’s Three Breakthroughs#

1. Producer–Consumer Asynchrony with Pingpong Scheduling#

2. Intra-Warpgroup GEMM–Softmax Overlap#

3. FP8 Done Right: Fast and Accurate#

Efficiency: Handling Layout Constraints#

Accuracy: Block Quantization & Incoherent Processing#

Experimental Results#

Speed#

Ablation Study#

Accuracy#

Conclusion: Lessons from FlashAttention-3#