Making Transformers Fly: A Deep Dive into Linear Attention

Since their introduction in the landmark 2017 paper Attention Is All You Need, Transformers have taken the world of AI by storm. Models like BERT, GPT-3, and DALL·E have revolutionized natural language processing, computer vision, and beyond. They are the engines behind the generative AI boom—capable of writing code, creating art, and holding surprisingly coherent conversations.

But these powerful models have a costly secret: a computational bottleneck that has, until recently, put a hard limit on how much information they can handle at once. The core of the Transformer, the self-attention mechanism, has a computational and memory complexity of \(O(N^2)\), where \(N\) is the length of the input sequence. This means that if you double the length of the text or the number of pixels in an image you’re processing, the cost doesn’t just double—it quadruples. For very long sequences—like high-resolution images, lengthy documents, or audio clips—this quadratic scaling becomes prohibitively expensive.

This limitation has spurred a wave of research into more “efficient” Transformers. How can we preserve the power of global attention without the crippling quadratic cost?

A fascinating 2020 paper, Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, proposes an elegant and powerful answer. The researchers introduce the Linear Transformer, a model that reduces the complexity of self-attention from \(O(N^2)\) to a much more manageable \(O(N)\). This shift leads to staggering speedups—up to 4000× faster in some cases—and it reveals a deep connection between Transformers and Recurrent Neural Networks (RNNs), two architectures often seen as fundamentally different approaches to sequence modeling.

In this article, we’ll unpack the key ideas behind the Linear Transformer. We’ll explore how a clever mathematical trick tames the quadratic beast, how this change leads to lightning-fast generative models, and why, in a very real sense, Transformers can be viewed as RNNs.

The Problem: Understanding the \(O(N^2)\) Bottleneck

Before diving into the solution, let’s revisit how standard self-attention works and pinpoint where the bottleneck lies.

A Transformer consists of stacked layers. Each layer has two main parts:

Self-attention, which lets elements of a sequence interact.
A feedforward network, which processes each element independently.

In the self-attention step, each element in the sequence produces three vectors:

Query (Q) — represents “what I’m looking for.”
Key (K) — represents “what I contain.”
Value (V) — holds the actual information.

To determine how much attention the i-th element gives the j-th element, the model computes the dot product of \(Q_i\) and \(K_j\). These dot products across the entire sequence form the attention matrix, and after applying a softmax, each element aggregates information from all others.

The standard softmax self-attention equation. The QKᵀ multiplication is the source of the quadratic complexity.

Figure: In standard self-attention, computing \(QK^T\) creates a large \(N \times N\) matrix—this is what causes the quadratic growth in time and memory.

Here lies the problem: for a sequence of length \(N\), we must compute and store every pairwise interaction, resulting in an \(N \times N\) matrix. The cost—both memory and time—scales with \(O(N^2)\). For long texts or images, this matrix quickly becomes enormous and unmanageable.

The Core Idea: Linearizing Attention with Kernels

The authors take a fresh perspective: attention is essentially a weighted average, with weights determined by a similarity function between queries and keys. Standard attention uses the exponential of a scaled dot product:

\[ \text{sim}(Q_i, K_j) = \exp(Q_i^T K_j / \sqrt{D}) \]

Generalized attention equation where sim() is any similarity function.

Figure: Generalized attention formulation in which the similarity score can be replaced with any valid kernel function.

The key insight is to replace this exponential similarity with one that can be expressed as a dot product between feature maps (kernels):

\[ \text{sim}(Q_i, K_j) = \phi(Q_i)^T \phi(K_j) \]

where \(\phi(\cdot)\) is a feature mapping function.

Substituting this kernel representation into the attention equation yields:

Linearized attention equation using a feature map φ.

Figure: Linearized attention representation using kernel feature maps for queries and keys.

Now comes the breakthrough: by exploiting the associativity of matrix multiplication, we can reorganize the computation. Instead of computing every query–key similarity explicitly, we precompute sums over all keys and values:

\[ V_i' = \frac{\phi(Q_i)^T \sum_j \phi(K_j)V_j^T}{\phi(Q_i)^T \sum_j \phi(K_j)} \]

The linearized attention equation after applying the associative property. The summations are now independent of the query i.

Figure: A reformulation that moves summations outside the per-query calculation, enabling efficient computation.

This means the time-consuming parts—\(\sum_j \phi(K_j)V_j^T\) and \(\sum_j \phi(K_j)\)—can be computed once and reused for all queries. The model no longer needs to construct the full attention matrix.

Mathematically, the computation changes from \((\phi(Q)\phi(K)^T)V\) to \(\phi(Q)(\phi(K)^T V)\).

Visualizing the associative property. Instead of computing the N×N matrix in the first term, we compute the smaller D×M matrix, avoiding the quadratic cost.

Figure: The associative property allows the model to deal with smaller matrices, turning quadratic time into linear time.

The result: attention computation now scales linearly with sequence length, reducing both time and memory requirements to \(O(N)\).

Of course, this depends on an appropriate choice of feature map. The ideal exponential kernel has infinite dimension, but a practical finite alternative works beautifully. The authors use:

The feature map φ(x) = elu(x) + 1, which ensures positivity and works well in practice.

Figure: The feature map \(\phi(x) = \text{elu}(x) + 1\) guarantees positive similarity and stable gradients.

This simple activation ensures positive similarity scores and steady gradients, giving linear attention nearly identical performance to standard softmax attention—with dramatically better efficiency.

The “Aha!” Moment: Transformers as RNNs

The benefits become even more dramatic for autoregressive tasks—like generating text or images one piece at a time. Here, the model’s output from one step becomes the input for the next.

To prevent “cheating,” we use causal masking, ensuring each element only attends to past positions:

Causally masked attention, where the summation only goes up to the current position i.

Figure: Causal masking ensures attention respects sequence order (\(j ≤ i\)).

Applying the same linearization trick, we get an equation that involves cumulative sums up to position \(i\):

The linearized version of causally masked attention.

Figure: Linearized causal attention, enabling efficient autoregressive inference.

We then define two running totals:

The cumulative sum terms Sᵢ and Zᵢ.

Figure: The recurrently updated states \(S_i\) and \(Z_i\) that summarize all past information.

Crucially, these can be computed recursively:

\(S_i = S_{i-1} + \phi(K_i)V_i^T\)
\(Z_i = Z_{i-1} + \phi(K_i)\)

This recursion reveals something profound: the transformer layer behaves like a Recurrent Neural Network (RNN). The pair \((S_i, Z_i)\) serves as hidden states that evolve over time.

At each step, the model updates these states and computes its output:

The full transformer layer expressed as a recurrent neural network with hidden states sᵢ and zᵢ.

Figure: The Transformer layer reformulated as a recurrent model with evolving hidden states.

This equivalence means a Transformer can combine the best characteristics of both architectures:

Parallel training — during training, the sums can be computed efficiently in parallel across timesteps.
Constant-time inference — during generation, the model updates compact states \((s_i, z_i)\) sequentially with constant cost per step.

Unlike traditional Transformers, where generating the 1000th token requires recomputing all previous 999 attentions, the Linear Transformer can maintain a steady generation speed no matter how long the sequence grows.

Experiments: Putting Linear Attention to the Test

Theory is elegant—practice is decisive. The authors evaluate their model across several dimensions: runtime, memory usage, convergence behavior, and performance on real-world tasks.

Performance and Memory Efficiency

Comparison of time and memory requirements for different attention mechanisms. Linear attention (black line) scales linearly with sequence length, while softmax (red line) scales quadratically, quickly becoming infeasible.

Figure 1: Time and GPU memory requirements grow quadratically for softmax but linearly for the new kernel-based attention.

As expected, standard softmax attention explodes in cost as sequence length increases. Linear attention remains efficient, even for sequences reaching 65,536 tokens, achieving linear scaling that matches theoretical predictions.

To verify stability, the authors tested on a synthetic sequence duplication task. Linear attention converges smoothly and performs as reliably as full softmax attention, outperforming other efficient alternatives such as the Reformer.

Convergence plot on a sequence duplication task. Linear attention (black) converges as well as softmax (red dashed), while LSH attention (blue dotted) is less stable.

Figure 2: The Linear Transformer reaches the same final performance as softmax while maintaining stable convergence.

Autoregressive Image Generation – Up to 4000× Faster

When applied to autoregressive image generation (predicting pixels one at a time), Linear Transformers shine.

Method	Bits/dim	Images/sec
Softmax	0.621	0.45 (1×)
LSH-1	0.745	0.68 (1.5×)
LSH-4	0.676	0.27 (0.6×)
Linear (ours)	0.644	142.8 (317×)

Table 1: MNIST generation results—same quality, hundreds of times faster throughput.

For MNIST, the Linear Transformer matches the softmax model’s image quality while generating images 317× faster. For CIFAR-10, with longer pixel sequences, the speedup skyrockets to over 4,400×.

Results for MNIST and CIFAR-10 image generation. The speedup column shows the massive advantage of the Linear Transformer.

Figure 3: Linear attention achieves similar image quality to softmax but thousands of times faster inference.

The generated samples confirm that efficiency doesn’t compromise quality.

Unconditional samples and image completions for MNIST generated by the Linear Transformer. The model produces sharp, convincing digits.

Figure 4: MNIST samples and completions from the Linear Transformer. Output quality matches softmax while scaling efficiently.

Speech Recognition

To explore broader applicability, the authors tested on automatic speech recognition using the WSJ dataset. The results again highlight the advantages of linear attention.

Method	Validation PER	Time/epoch (s)
Bi-LSTM	10.94	1047
Softmax	5.12	2711
LSH-4	9.33	2250
Linear (ours)	8.08	824

Table 2: Speech recognition results—Linear Transformer trains faster than all baselines and delivers competitive accuracy.

Even for non-autoregressive, frame-level tasks, the linear formulation cuts training time dramatically without large losses in accuracy.

Conclusion: The Future of Fast Attention

The Linear Transformer is far more than another optimization—it’s a reconceptualization of self-attention itself. By replacing softmax with kernel-based feature maps and leveraging the associative property of matrix multiplication, attention transforms from quadratic to linear in both time and memory.

Key Takeaways

Linear Complexity: Linear Transformers compute attention with \(O(N)\) efficiency, opening the door to processing extremely long sequences.
Massive Speedups: In autoregressive tasks, inference becomes constant-time and constant-memory per step—yielding speedups of thousands of times.
Unified Framework: The discovery that Transformers can be expressed as RNNs bridges two historically separate model families, suggesting new hybrid architectures.

This research makes Transformers capable of handling tasks once deemed infeasible—long-form text synthesis, high-resolution video, full-length audio tracks, and more. It redefines how we think about both attention and recurrence, unlocking architectures that not only learn better but also run faster.

By making Transformers fly, Linear Attention doesn’t just make them efficient—it expands what’s possible with modern AI.

Making Transformers Fly: A Deep Dive into Linear Attention#

The Problem: Understanding the \(O(N^2)\) Bottleneck#

The Core Idea: Linearizing Attention with Kernels#

The “Aha!” Moment: Transformers as RNNs#

Experiments: Putting Linear Attention to the Test#

Performance and Memory Efficiency#

Autoregressive Image Generation – Up to 4000× Faster#

Speech Recognition#

Conclusion: The Future of Fast Attention#

Key Takeaways#