Introduction
The “Transformer” architecture has become synonymous with the recent explosion in Artificial Intelligence. From ChatGPT to LLaMA, the mechanism of Softmax Attention drives the ability of these models to understand and generate human language. However, this power comes with a significant cost: Quadratic Complexity.
In simple terms, as the length of the text (context) doubles, the computational cost of a standard Transformer quadruples. This \(O(N^2)\) complexity creates a massive bottleneck when trying to process books, codebases, or long conversation histories.
Enter Linear Complexity Models. These are architectures designed to process text with \(O(N)\) complexity—meaning if you double the text length, the cost only doubles. They promise the holy grail of AI: the performance of a Transformer with the efficiency of a Recurrent Neural Network (RNN).
But there has been a lingering question: Do these linear models scale? We know how to predict the performance of a standard Transformer as we make it bigger (thanks to Scaling Laws), but we haven’t had that certainty for linear alternatives.
In this post, we dive into a crucial research paper that establishes the Scaling Laws for Linear Complexity Language Models. The researchers trained models ranging from 70 million to 7 billion parameters on 300 billion tokens to find out if linear models can truly compete with the giants.
The Contenders: Quadratic vs. Linear
To understand the comparison, we must first introduce the architectures evaluated in this study. The researchers pitted a standard LLaMA baseline against three distinct linear architectures.
1. The Baseline: LLaMA (Softmax Attention)
This represents the current standard. It uses the traditional attention mechanism where every word attends to every other word. It is powerful but computationally heavy.
2. The Challengers (Linear Complexity)
The study examines three “efficient” architectures that modify how attention or memory works to achieve linear speed:
- TNL (TransNormerLLM): A linear attention model that uses “data-independent decay.” It uses a sliding window approach with Lightning Attention to handle memory.
- HGRN2 (Hierarchically Gated RNN): A modern RNN that uses “data-dependent decay.” It features a state expansion mechanism to increase the size of its recurrent memory without exploding parameter counts.
- cosFormer2: A linear attention model without decay. It uses a cosine-based reweighting mechanism to highlight important information.
The fundamental difference lies in how they calculate the relationship between tokens. As shown in the table below, while LLaMA’s FLOPs (floating point operations) scale with \(n\) (sequence length) inside a quadratic term, the linear models maintain a relationship where the sequence length \(n\) does not multiply by itself in the dominant terms.

The Core Discovery: Scaling Laws Exist
The primary contribution of this paper is proving that linear complexity models follow the same power-law scaling trends as Transformers.
Scaling laws allow researchers to predict a model’s loss (error rate) based on the compute budget used to train it. The researchers followed the “Chinchilla” methodology (Hoffmann et al., 2022), training dozens of models to find the optimal trade-off between model size and dataset size.
Visualizing the Scaling
The figure below is the most critical visualization in the paper. It plots the Training Loss against the Compute Budget (PFLOPs-days).
- Left Column (Loss vs. Compute): All four architectures (LLaMA and the three linear models) show a straight line on a log-log plot. This confirms that as you add more compute, the models get better at a predictable rate.
- Center Column (Optimal Model Size): This shows how big your model should be for a given budget.
- Right Column (Optimal Tokens): This shows how much data you should use.

The Verdict on Efficiency
Remarkably, the study found that linear complexity models exhibit similar scaling capabilities to conventional Transformers. In fact, under the same compute budget, linear models often achieve lower training loss.
The table below summarizes the mathematical power laws derived from the experiments. The coefficients (\(\alpha\) and \(\beta\)) for the linear models are highly competitive with LLaMA.

Beyond Loss: Downstream Performance
Low training loss is great, but does the model actually understand language? To test this, the researchers evaluated the models on downstream tasks, including Common Sense Reasoning (CSR) and Validation Perplexity.
Proficiency and Knowledge
The results were surprising. In many cases, linear models outperformed LLaMA when normalized for compute.
- Perplexity: Linear models like HGRN2 and cosFormer2 achieved lower perplexity (better prediction) on datasets like WikiText-2.
- Reasoning: On benchmarks like HellaSwag and PIQA, linear models consistently demonstrated superior reasoning capabilities at the 7B parameter scale.
The figure below tracks performance across model sizes. You can see the linear models (TNL, HGRN2, cosFormer2) trending upwards in accuracy and downwards in perplexity, often hugging or crossing the LLaMA lines.

A detailed breakdown of the scores at the 7 billion parameter mark shows that HGRN2 and TNL are particularly strong contenders in general linguistic tasks.

The Achilles’ Heel: Retrieval Tasks
If linear models are faster and just as smart, why aren’t we using them everywhere? The study uncovered a significant limitation: Retrieval.
The researchers tested the models using the “Needle in a Haystack” (NIAH) benchmark. In this task, a specific piece of information (the needle) is hidden inside a long text (the haystack), and the model must retrieve it.
The “Recall” Problem
While LLaMA (Softmax Attention) excels at looking back at specific tokens regardless of how far back they are, linear models struggle. Because linear models compress context into a fixed-size recurrent state, information can get “diluted” over time.
This equation shows the Softmax attention mechanism. Notice how it computes interactions between Queries (\(Q\)) and Keys (\(K\)) explicitly:

In contrast, linear recurrence works by updating a running state. It doesn’t re-scan the history; it just updates the memory:

Visual Proof: The Heatmaps
The difference is stark when visualized. Below is a heatmap for LLaMA 7B in “Standard Mode” (retrieval + comprehension). Green indicates success (score 10), and red indicates failure. LLaMA is mostly green, successfully retrieving information across various depths and lengths.

Now, look at HGRN2 7B (a linear model) on the same task. While it performs decently in some areas (the green patches), there are significant red zones where it fails to retrieve the information, particularly as the task gets more complex.

Similarly, TNL 7B shows a struggle to maintain consistent retrieval across the full context window compared to the Transformer baseline.

The researchers concluded that while linear models are excellent at general language modeling and reasoning, they lack the “Going Through a Book” (GTB) capability—the ability to re-scan exact previous inputs—which Transformers possess inherently.
Architecture Nuances: Shape Matters
The paper highlights another interesting divergence from traditional scaling laws: Aspect Ratio Sensitivity.
For Transformers like LLaMA, the exact shape of the model (how deep vs. how wide it is) usually doesn’t matter much, provided the total parameter count is the same. However, linear models proved to be much more sensitive to this.
As shown in the table below, pushing the hidden dimension too high (and consequently reducing the number of layers) caused a collapse in retrieval performance for linear models like cosFormer2, while LLaMA remained relatively stable.

Conclusion and Implications
This research provides a solid foundation for the future of efficient LLMs. The key takeaways are:
- Predictability: We can now confidently scale linear models using established power laws.
- Efficiency Wins: For general language generation and reasoning, linear models offer a faster, more efficient alternative to Transformers without sacrificing quality.
- The Retrieval Gap: The primary hurdle remaining is precise information retrieval from long contexts. Linear models compress history, while Transformers preserve it exactly.
The implications are significant. For applications like chatbots, code generation, and creative writing where “gist” and reasoning are more important than photographic memory, linear models (like HGRN2 or TNL) are ready for the big leagues. However, for tasks requiring precise citation from massive documents, the traditional Transformer reigns supreme—for now.
](https://deep-paper.org/en/paper/2406.16690/images/cover.png)