](https://deep-paper.org/en/paper/2307.08691/images/cover.png)
FlashAttention-2: Even Faster, Even More Efficient Attention for Transformers
If you’ve been following the world of large language models, you know that one of the biggest goals is expanding the context window. We want models that can read entire books, analyze lengthy codebases, or process high-resolution images. The main obstacle? The attention mechanism at the heart of the Transformer architecture. Its computational and memory costs grow quadratically with the sequence length, making long contexts prohibitively expensive. A breakthrough paper in 2022, FlashAttention, tackled this problem head-on. By cleverly reordering the attention computation to be more aware of the GPU’s memory hierarchy, it achieved linear memory usage and a 2–4× speedup over standard implementations—all without any approximation. It was a game-changer and has been widely adopted. ...