Beyond Transformers: Scaling Deep Learning Sub-Quadratically with the Monarch Mixer

For the last decade, progress in deep learning has been a tale of scale: larger models, longer contexts, and wider feature spaces. That scaling has unlocked impressive capabilities, but it’s running into a practical limit—compute cost. At the heart of many state-of-the-art models are operations whose time and memory cost grow quadratically: attention over sequence length \(N\) and dense MLPs over feature dimension \(d\). Doubling the sequence length or the model width can quadruple compute. Quadratic growth quickly becomes prohibitive as we push to longer contexts and wider networks.

MONARCH MIXER (M2), a recent architecture from researchers at Stanford and SUNY, asks a simple but bold question: can we design a single, hardware-friendly primitive that mixes information both across sequence and across feature dimensions while scaling sub-quadratically in both axes? The answer: yes. The building block is the Monarch matrix, a structured, GEMM-friendly object that generalizes transforms like the FFT and can be evaluated with block matrix multiplications. Stacking Monarch-based mixers into a model—M2—yields an architecture that is sub-quadratic, efficient on modern GPUs, and surprisingly competitive with Transformers across language and vision benchmarks.

This post unpacks the intuition, the math, and the systems thinking behind M2. I’ll focus on three things you need to know:

What Monarch matrices are and why they are both expressive and fast on hardware.
How the M2 layer mixes sequence and feature information with sub-quadratic cost.
The key theoretical trick used to make M2 causal (autoregressive) while preserving sub-quadratic scaling.

Along the way we’ll look at empirical evidence showing that M2 matches or exceeds Transformer baselines in several tasks while using fewer parameters or delivering much higher throughput at long contexts.

A high-level overview of Monarch matrices and the Monarch Mixer (M2). Monarch matrices are a sub-quadratic, hardware-efficient primitive used to mix information along both the sequence and feature dimensions.

Figure 1: Monarch matrices are constructed as products of block-diagonal factors interleaved with permutations. M2 applies Monarch-based mixing first along the sequence axis and then along the feature axis, using only matrix multiplies, reshapes, transposes, and pointwise operations (GEMM-friendly primitives).

Why focus on structured matrices? Because the performance of a neural primitive on accelerators depends on two orthogonal things: asymptotic algorithmic complexity and how well it maps to fast hardware primitives (GEMMs / tensor cores). Attention and dense MLPs are powerful but suffer from quadratic complexity in the dimension they mix. FFT-based long convolutions are asymptotically attractive (\(O(N\log N)\)) but in practice are memory-bound and barely use the GPU’s FLOP capacity. Monarch matrices aim to hit a sweet spot: sub-quadratic asymptotics while being implementable with high-FLOP GEMMs.

A quick reminder about how primitives behave on GPUs

Compute-bound ops: dominated by arithmetic. Dense GEMMs fall here and can achieve very high FLOP utilization on modern tensor cores.
Memory-bound ops: dominated by data movement (loads/stores) and suffer when arithmetic intensity is low. Attention and FFTs are often memory-bound in practice.

The design target for M2: sub-quadratic asymptotics + high arithmetic intensity (so GEMMs can be used efficiently).

A comparative snapshot of typical mixers (cost and utilization):

A comparison of the FLOP cost and hardware utilization for common mixer layers. M2 convolution offers a compelling middle ground with sub-quadratic scaling and high utilization.

Figure 2: Example FLOP cost vs. utilization for common mixer layers (measured at large input sizes). MLPs are compute-bound and highly efficient but scale quadratically in dimension; FFT-based convolutions have great asymptotics but low utilization; M2 offers sub-quadratic scaling with much higher GEMM utilization.

What are Monarch matrices?

At a high level, an order-\(p\) Monarch matrix \(\mathbf{M} \in \mathbb{R}^{N\times N}\) is expressed as a product of permutations and block-diagonal matrices:

\[ \mathbf{M} = \left(\prod_{i=1}^{p} \mathbf{P}_i \mathbf{B}_i\right)\mathbf{P}_0. \]

Each \(\mathbf{B}_i\) is block-diagonal: it splits the input into many independent, small blocks and applies a small dense matrix to each. The permutations \(\mathbf{P}_i\) rearrange elements between block multiplications. If you pick block sizes appropriately (e.g., \(b = \sqrt[p]{N}\) so the number of blocks is also \(b\)), the total cost of evaluating \(\mathbf{M}\mathbf{x}\) becomes sub-quadratic in \(N\); for \(p=2\) and \(b=\sqrt{N}\), the cost is \(O(N^{3/2})\). Larger \(p\) lets you push toward \(O(N\log N)\)-like behavior while still maintaining GEMM-friendly structure.

Why is this hardware-friendly? Because although \(\mathbf{M}\) is a dense \(N\times N\) transform in effect, you never materialize the full dense matrix. You perform many small dense multiplications (the block GEMMs) and permutations—operations that map well to highly-optimized BLAS libraries and tensor cores. That keeps arithmetic intensity high and reduces the penalty from memory-bound permutations as input size grows.

The M2 layer: a simple two-stage mixer

M2 adopts the mixer view: each layer mixes along the sequence axis and then along the model (feature) axis. Let \(\mathbf{X}\in\mathbb{R}^{N\times d}\) be an input (sequence length \(N\), model dimension \(d\)). A basic M2 layer (order-2 description) looks like:

Sequence mixing (generalized gated convolution):

\[ \tilde{\mathbf{X}} = \mathbf{M}_2\big(\mathbf{K}_1 \odot (\mathbf{M}_1 \mathbf{X})\big), \]

Dimension mixing (Monarch-based MLP replacement):

\[ \mathbf{Y}^\top = \mathbf{M}_4\;\sigma\big(\mathbf{M}_3 \tilde{\mathbf{X}}^\top\big). \]

Here \(\mathbf{M}_1, \mathbf{M}_2\in\mathbb{R}^{N\times N}\) and \(\mathbf{M}_3,\mathbf{M}_4\in\mathbb{R}^{d\times d}\) are Monarch matrices (order chosen by design). \(\mathbf{K}_1\) is a kernel tensor, \(\odot\) denotes elementwise multiplication, and \(\sigma\) is a pointwise nonlinearity (e.g., ReLU or GeLU). If you set \(\mathbf{M}_1\) to the DFT and \(\mathbf{M}_2\) to the inverse DFT, the first equation is exactly a (long) convolution computed via frequency-domain filtering; replacing DFTs by learned Monarch factors generalizes this idea while keeping GEMM-friendly structure.

The implementation is pleasantly compact: an M2 layer requires only reshapes, transposes, block GEMMs, and elementwise operations. Because the heavy lifting is block-GEMM, the implementation attains high FLOP utilization on modern GPUs.

Real performance — when does sub-quadratic win?

Monarch’s asymptotic promises matter most for large \(N\). The authors implemented a straightforward M2 operator (using cuBLAS for block GEMMs and custom permutations) and measured FLOP utilization and running time against dense matmuls. On large inputs the sub-quadratic scaling dominates the cost of permutations and the M2 operator outperforms dense multiplication by orders of magnitude:

A comparison of FLOP cost and utilization for M2 versus a dense MLP. As the input size N increases, M2’s sub-quadratic scaling provides significant speedups and eventually becomes more hardware-efficient.

Figure 3: M2 FLOP cost and measured FLOP utilization vs. dense matmul at different input sizes on NVIDIA A100 and an RTX 4090. For small \(N\) the permutation overhead hurts, but for large \(N\) the sub-quadratic scaling wins and M2 runs much faster.

A few takeaways:

For short sequences, existing attention and dense MLP implementations (heavily optimized) still win because the fixed cost of permutations matters more.
For long sequences (thousands—tens of thousands), M2’s \(O(N^{3/2})\) (or better, depending on \(p\)) complexity outperforms dense operations by large margins. On GPUs with larger caches (RTX 4090 in their tests), optimized M2 implementations hit >40% FLOP utilization on huge inputs.

Making M2 causal: the polynomial perspective

A big technical hurdle arises when you want to use M2 for autoregressive (causal) modeling: masking the future can turn a sub-quadratic primitive into a quadratic bottleneck if handled naively. The authors’ elegant solution is to reinterpret Monarch multiplication as multivariate polynomial evaluation and interpolation. That provides a clean algebraic handle on causality.

Key ideas (intuitive sketch)

Order-2 Monarch multiplication \(\mathbf{M}\mathbf{u}\) can be seen as evaluating a bivariate polynomial \(u(X,Y)\) on a \(\sqrt{N}\times\sqrt{N}\) grid of points (e.g., roots of unity). The Monarch factors define the basis polynomials.
The M2 convolution-like operation \[ \mathbf{f} = \mathbf{M}_0^{-1}\big((\mathbf{M}_1\mathbf{k}) \odot (\mathbf{M}_2\mathbf{u})\big) \] corresponds to forming polynomials \(k(X,Y)\), \(u(X,Y)\), multiplying them to get \(h(X,Y)=k(X,Y)u(X,Y)\) modulo the evaluation polynomial, and interpolating back to coefficient space.
The trouble is that multiplication modulo \(X^b-1, Y^b-1\) can cause circular wrap-around, so an output coefficient may depend on future input coefficients. To prevent that, the authors carefully constrain the minimum and maximum degrees of the basis polynomials so that the product of two basis polynomials \(q_j(Z)q_{j'}(Z)\) only contributes to basis elements with index \(a\ge j+j'\). This enforces a triangular dependency pattern that enforces causality.

Visually:

A schematic showing how the Monarch Mixer can be parameterized to be causal. By viewing Monarch matrices through the lens of polynomial evaluation and carefully constraining the degrees of these polynomials, the authors ensure that the resulting map is causal.

Figure 4: Interpreting Monarch multiplication as polynomial evaluation/interpolation enables a parameterization that enforces causality: degree constraints guarantee products only affect later coefficients.

The end result is a constructive theorem: with a specific assignment of minimum and maximum polynomial degrees to the Monarch factors (and after embedding the input sequence into a larger vector of size \(N\) in a controlled way), the M2 convolution is provably causal and can be computed using the same block-GEMM primitive—i.e., sub-quadratically.

Experiments: can M2 compete?

The paper evaluates M2 in three settings where Transformers excel: non-causal masked language modeling (BERT), image classification (ViT), and causal language modeling (GPT-like pretraining). The results are compelling.

Non-causal language modeling — M2-BERT

The authors build M2-BERT by replacing the attention blocks with a bidirectional gated convolution implemented via Monarch primitives and replacing the dense MLP matrices with small block-diagonal Monarch matrices. They trained BERT-equivalent models (wide and narrow variants) on standard corpora and evaluated on GLUE.

A compact M2-BERT-base (80M parameters) matches BERT-base (110M) on average GLUE, a ~27% parameter reduction.
When matched on parameters, an M2-BERT-base variant outperforms BERT-base by +1.3 GLUE points.
For larger models, M2-BERT-large matches or slightly exceeds BERT-large quality with fewer parameters.

Architecture sketch:

The architecture of M2-BERT, which uses Monarch matrices for both the sequence mixer (as a gated long convolution) and the dimension mixer (replacing linear layers).

Figure 5: M2-BERT replaces attention with a gated Monarch-based long convolution in the sequence mixer and replaces the MLP’s dense matrices with block-diagonal Monarch matrices in the dimension mixer.

Beyond quality, the throughput wins at long contexts are dramatic. On an A100 GPU, the M2-BERT-base model achieves higher tokens/ms than HuggingFace BERT and FlashAttention BERT as sequence length grows; at sequence length 4096 it reaches up to a 9.1× speedup over a standard HuggingFace BERT implementation and ~3× over FlashAttention in large contexts. On CPU inference, M2-BERT becomes faster than BERT as sequence length grows (advantages appear from ~1K tokens upwards), with up to ~6.5× faster inference at very long contexts on the tested machine.

Average GLUE and throughput visual:

Average GLUE scores for M2-BERT-base models. M2-BERT matches BERT-base quality with 27% fewer parameters and surpasses it when parameter-matched.

Figure 6: M2-BERT matches or exceeds BERT performance with fewer parameters. The two numbers reflect a parameter tradeoff: a smaller M2 can match a larger BERT, and a parameter-matched M2 can exceed BERT.

Throughput comparison between M2-BERT and standard BERT implementations. M2’s sub-quadratic scaling leads to dramatic speedups as sequence length increases.

Figure 7: Tokens/ms as a function of context length. M2-BERT maintains high throughput as sequence length grows, demonstrating the benefits of sub-quadratic scaling.

Vision — M2-ViT

To check modality generality, the authors adapt M2 to Vision Transformers. By replacing Hyena’s long convolutions (which use FFTs) and ViT MLPs with Monarch-based equivalents, they build M2-ViT. The headline result: M2-ViT achieves slightly higher top-1 accuracy on ImageNet-1k than ViT-base while using roughly half the parameters in the comparable configuration. M2-ViT also outperforms HyenaViT-b and several Monarch-augmented ViT variants, showing these ideas translate outside language.

ImageNet result snapshot:

ImageNet-1k accuracy results. M2-ViT outperforms the ViT-b baseline with only half the parameters, demonstrating the architecture’s versatility.

Figure 8: M2-ViT obtains competitive or better ImageNet performance while reducing parameter count (compared to the ViT baseline).

Causal language modeling — M2-GPT

This is the most striking proof-of-concept. The authors construct M2-GPT, a completely attention-free, MLP-free autoregressive model that uses the causal polynomial parameterization of Monarch matrices. They pretrain on The PILE and evaluate perplexity. Results (selected):

At ~360M parameters, M2-GPT achieves lower perplexity than both a Transformer baseline and Hyena (a state-of-the-art attention-free model): e.g., final perplexities near or slightly better than the Transformer and Hyena baselines across pretraining regimes.
At smaller scales, M2-GPT shows consistent gains over Hyena and matches Transformers.

Perplexity comparison:

Perplexity on The PILE dataset for causal language models. M2-GPT, which contains no attention or MLPs, matches or outperforms strong Transformer and Hyena baselines.

Figure 9: M2-GPT perplexity vs. token budget. M2-GPT reaches perplexities competitive with Transformers and Hyena models, supporting the claim that an attention- and MLP-free model can match GPT-style pretraining performance.

Why these results matter

Sub-quadratic primitives can be practical: M2 shows you can get the asymptotic benefits of structured transforms while preserving high hardware utilization, so the theory translates to real speedups at large scales.
One primitive for both axes: M2 replaces both attention (sequence mixing) and dense MLPs (feature mixing) with the same structured building block. That simplifies reasoning about scaling and optimization.
Causality without a quadratic bottleneck: the polynomial view is a conceptually clean way to parameterize causal operations while avoiding the naive \(O(N^2)\) masking cost.
Versatility across modalities: M2 works for language and vision and can be combined with convolutions and gating ideas used in recent attention-free models.

Caveats and future work

Implementation maturity: the prototypes in the paper show promising utilization, but further systems work (kernel fusion, better permutation kernels, fused block-GEMMs) could substantially improve short-sequence performance and lower overheads.
Memory and constant factors: Monarch’s sub-quadratic asymptotics require careful block sizing and may involve memory layout tradeoffs. The best block size depends on hardware cache behavior.
Wider applicability: while experiments span key tasks, it remains to be seen how M2 performs across the full diversity of Transformer applications (multimodal models, reinforcement learning, generation at extreme scales).
Causality padding: the causal parameterization sometimes requires embedding the input into a slightly larger space. For \(p>2\) the current constructions can imply a larger embedding factor; reducing that blow-up is an interesting open problem.

Conclusion

MONARCH MIXER takes a refreshing, system-aware approach to architectural design: start from a structured linear-algebraic primitive that is both expressive and implementable as high-throughput GEMMs, and then build the model around that primitive. By grounding the architecture in block-GEMMs and using an algebraic (polynomial) lens to enforce causality, the authors show that sub-quadratic scaling need not be purely theoretical—it can lead to practical models that match or exceed Transformer baselines on quality while delivering large throughput wins for long contexts.

If you’re interested in alternatives to attention, or in designing primitives tuned to modern accelerator stacks, M2 is an important new point in the design space: it demonstrates that with the right structure and the right systems thinking, we can scale both sequence length and model dimension more cheaply than before.

Further reading and code pointers (paper)

The full paper contains proofs, PyTorch listings for compact M2 implementations, and additional experiments (speech, CIFAR, roofline analysis) that dig deeper into the tradeoffs and implementation details. If you want to explore or reproduce the experiments, start with the repository linked from the paper.

Acknowledgment: this post summarizes and explains the ideas of the paper “MONARCH MIXER: A Simple Sub-Quadratic GEMM-Based Architecture” (Fu et al., 2023). For full technical details, proofs, and experimental setups, refer to the original manuscript.