](https://deep-paper.org/en/papers/2025-10/2407.08608/images/cover.png)
Unpacking FlashAttention-3: How Asynchrony and FP8 Supercharge Transformers
The Transformer architecture is the powerhouse behind today’s AI revolution, but it has one stubborn bottleneck: the attention mechanism. As we push for larger models that can process entire books, massive codebases, or hours of video, the quadratic complexity of attention becomes a major computational obstacle. Simply put, the longer the input, the more the attention mechanism struggles—and cost skyrockets. This scaling issue has sparked intense innovation in making attention faster and more efficient. A few years back, FlashAttention appeared as a breakthrough: by cleverly managing memory I/O on GPUs, it delivered exact attention at high speed without resorting to approximations. Its successor, FlashAttention-2, improved parallelism and load balancing—but even then, on cutting-edge NVIDIA H100 GPUs, it achieved only ~35% of the hardware’s theoretical maximum throughput. ...