Can Linear Models Replace Transformers? A Deep Dive into Scaling Laws

Introduction

The “Transformer” architecture has become synonymous with the recent explosion in Artificial Intelligence. From ChatGPT to LLaMA, the mechanism of Softmax Attention drives the ability of these models to understand and generate human language. However, this power comes with a significant cost: Quadratic Complexity.

In simple terms, as the length of the text (context) doubles, the computational cost of a standard Transformer quadruples. This \(O(N^2)\) complexity creates a massive bottleneck when trying to process books, codebases, or long conversation histories.

Enter Linear Complexity Models. These are architectures designed to process text with \(O(N)\) complexity—meaning if you double the text length, the cost only doubles. They promise the holy grail of AI: the performance of a Transformer with the efficiency of a Recurrent Neural Network (RNN).

But there has been a lingering question: Do these linear models scale? We know how to predict the performance of a standard Transformer as we make it bigger (thanks to Scaling Laws), but we haven’t had that certainty for linear alternatives.

In this post, we dive into a crucial research paper that establishes the Scaling Laws for Linear Complexity Language Models. The researchers trained models ranging from 70 million to 7 billion parameters on 300 billion tokens to find out if linear models can truly compete with the giants.

The Contenders: Quadratic vs. Linear

To understand the comparison, we must first introduce the architectures evaluated in this study. The researchers pitted a standard LLaMA baseline against three distinct linear architectures.

1. The Baseline: LLaMA (Softmax Attention)

This represents the current standard. It uses the traditional attention mechanism where every word attends to every other word. It is powerful but computationally heavy.

2. The Challengers (Linear Complexity)

The study examines three “efficient” architectures that modify how attention or memory works to achieve linear speed:

TNL (TransNormerLLM): A linear attention model that uses “data-independent decay.” It uses a sliding window approach with Lightning Attention to handle memory.
HGRN2 (Hierarchically Gated RNN): A modern RNN that uses “data-dependent decay.” It features a state expansion mechanism to increase the size of its recurrent memory without exploding parameter counts.
cosFormer2: A linear attention model without decay. It uses a cosine-based reweighting mechanism to highlight important information.

The fundamental difference lies in how they calculate the relationship between tokens. As shown in the table below, while LLaMA’s FLOPs (floating point operations) scale with \(n\) (sequence length) inside a quadratic term, the linear models maintain a relationship where the sequence length \(n\) does not multiply by itself in the dominant terms.

Checklist of Model Parameters and FLOPs comparing LLaMA against linear models.

The Core Discovery: Scaling Laws Exist

The primary contribution of this paper is proving that linear complexity models follow the same power-law scaling trends as Transformers.

Scaling laws allow researchers to predict a model’s loss (error rate) based on the compute budget used to train it. The researchers followed the “Chinchilla” methodology (Hoffmann et al., 2022), training dozens of models to find the optimal trade-off between model size and dataset size.

Visualizing the Scaling

The figure below is the most critical visualization in the paper. It plots the Training Loss against the Compute Budget (PFLOPs-days).

Left Column (Loss vs. Compute): All four architectures (LLaMA and the three linear models) show a straight line on a log-log plot. This confirms that as you add more compute, the models get better at a predictable rate.
Center Column (Optimal Model Size): This shows how big your model should be for a given budget.
Right Column (Optimal Tokens): This shows how much data you should use.

Training Curve Fitting for Four Architectures showing Loss, Model Size, and Tokens vs Compute.

The Verdict on Efficiency

Remarkably, the study found that linear complexity models exhibit similar scaling capabilities to conventional Transformers. In fact, under the same compute budget, linear models often achieve lower training loss.

The table below summarizes the mathematical power laws derived from the experiments. The coefficients (\(\alpha\) and \(\beta\)) for the linear models are highly competitive with LLaMA.

Summary of Scaling Laws illustrating the relationship between loss, parameters, and corpus size.

Beyond Loss: Downstream Performance

Low training loss is great, but does the model actually understand language? To test this, the researchers evaluated the models on downstream tasks, including Common Sense Reasoning (CSR) and Validation Perplexity.

Proficiency and Knowledge

The results were surprising. In many cases, linear models outperformed LLaMA when normalized for compute.

Perplexity: Linear models like HGRN2 and cosFormer2 achieved lower perplexity (better prediction) on datasets like WikiText-2.
Reasoning: On benchmarks like HellaSwag and PIQA, linear models consistently demonstrated superior reasoning capabilities at the 7B parameter scale.

The figure below tracks performance across model sizes. You can see the linear models (TNL, HGRN2, cosFormer2) trending upwards in accuracy and downwards in perplexity, often hugging or crossing the LLaMA lines.

Comparative performance across distinct benchmarks like CSR and Perplexity.

A detailed breakdown of the scores at the 7 billion parameter mark shows that HGRN2 and TNL are particularly strong contenders in general linguistic tasks.

Benchmark of Downstream Tasks including CSR, Perplexity, and Retrieval.

The Achilles’ Heel: Retrieval Tasks

If linear models are faster and just as smart, why aren’t we using them everywhere? The study uncovered a significant limitation: Retrieval.

The researchers tested the models using the “Needle in a Haystack” (NIAH) benchmark. In this task, a specific piece of information (the needle) is hidden inside a long text (the haystack), and the model must retrieve it.

The “Recall” Problem

While LLaMA (Softmax Attention) excels at looking back at specific tokens regardless of how far back they are, linear models struggle. Because linear models compress context into a fixed-size recurrent state, information can get “diluted” over time.

This equation shows the Softmax attention mechanism. Notice how it computes interactions between Queries (\(Q\)) and Keys (\(K\)) explicitly:

Equation showing Softmax attention calculation.

In contrast, linear recurrence works by updating a running state. It doesn’t re-scan the history; it just updates the memory:

Equation showing linear recurrence calculation.

Visual Proof: The Heatmaps

The difference is stark when visualized. Below is a heatmap for LLaMA 7B in “Standard Mode” (retrieval + comprehension). Green indicates success (score 10), and red indicates failure. LLaMA is mostly green, successfully retrieving information across various depths and lengths.

Heatmap of Needle In A HayStack for LLaMA 7B showing mostly green successful retrieval.

Now, look at HGRN2 7B (a linear model) on the same task. While it performs decently in some areas (the green patches), there are significant red zones where it fails to retrieve the information, particularly as the task gets more complex.

Heatmap of Needle In A HayStack for HGRN2 7B showing mixed results with red failure zones.

Similarly, TNL 7B shows a struggle to maintain consistent retrieval across the full context window compared to the Transformer baseline.

Heatmap of Needle In A HayStack for TNL 7B showing struggling performance.

The researchers concluded that while linear models are excellent at general language modeling and reasoning, they lack the “Going Through a Book” (GTB) capability—the ability to re-scan exact previous inputs—which Transformers possess inherently.

Architecture Nuances: Shape Matters

The paper highlights another interesting divergence from traditional scaling laws: Aspect Ratio Sensitivity.

For Transformers like LLaMA, the exact shape of the model (how deep vs. how wide it is) usually doesn’t matter much, provided the total parameter count is the same. However, linear models proved to be much more sensitive to this.

As shown in the table below, pushing the hidden dimension too high (and consequently reducing the number of layers) caused a collapse in retrieval performance for linear models like cosFormer2, while LLaMA remained relatively stable.

Benchmark of Aspect Ratio and Model Capacity showing sensitivity of linear models.

Conclusion and Implications

This research provides a solid foundation for the future of efficient LLMs. The key takeaways are:

Predictability: We can now confidently scale linear models using established power laws.
Efficiency Wins: For general language generation and reasoning, linear models offer a faster, more efficient alternative to Transformers without sacrificing quality.
The Retrieval Gap: The primary hurdle remaining is precise information retrieval from long contexts. Linear models compress history, while Transformers preserve it exactly.

The implications are significant. For applications like chatbots, code generation, and creative writing where “gist” and reasoning are more important than photographic memory, linear models (like HGRN2 or TNL) are ready for the big leagues. However, for tasks requiring precise citation from massive documents, the traditional Transformer reigns supreme—for now.

Introduction#

The Contenders: Quadratic vs. Linear#

1. The Baseline: LLaMA (Softmax Attention)#

2. The Challengers (Linear Complexity)#

The Core Discovery: Scaling Laws Exist#

Visualizing the Scaling#

The Verdict on Efficiency#

Beyond Loss: Downstream Performance#

Proficiency and Knowledge#

The Achilles’ Heel: Retrieval Tasks#

The “Recall” Problem#

Visual Proof: The Heatmaps#

Architecture Nuances: Shape Matters#

Conclusion and Implications#