Introduction

In the current landscape of Deep Learning, we are witnessing a massive tug-of-war between two fundamental properties: parallelization and expressivity.

On one side, we have Transformers and State-Space Models (SSMs) like Mamba. These architectures dominate because they are highly parallelizable during training. You can feed them a sequence of text, and they process all tokens simultaneously using GPUs. However, there is a catch. Theoretically, these models belong to a complexity class (specifically $TC^0$) that cannot fully solve inherently sequential problems, such as tracking state in a finite state machine (FSM) or solving complex parity problems. They suffer from a “depth” limit.

On the other side, we have classical Recurrent Neural Networks (RNNs). RNNs process data sequentially, updating a hidden state step-by-step. This makes them incredibly expressive for state-tracking problems—they can theoretically model any algorithm that an FSM can. But, they are painfully slow to train because you cannot parallelize the sequential dependency (step $t$ waits for step $t-1$).

This leads to a fundamental question: Must we sacrifice the ability to reason sequentially to gain training speed?

A new research paper, Implicit Language Models are RNNs: Balancing Parallelization and Expressivity, suggests the answer is no. The researchers propose Implicit SSMs. By iterating a transformation until it converges to a “fixed point,” these models behave like parallelizable Transformers during training but act like infinitely deep, non-linear RNNs during inference.

As shown in the figure below, the results are striking. While standard models like Mamba2 fail to generalize to longer sequences or solve hard logic puzzles ($S_5$), Implicit Mamba2 maintains accuracy regardless of sequence length.

$Top Left: Minimum layers required to solve the \$S _ { 5 }\$ word problem. Top Right: Length generalization comparison. Center: Downstream task accuracy. Bottom: Scaling of language models.$

In this post, we will tear down this paper to understand how Implicit Models work, the mathematics behind their “infinite” depth, and how they were scaled to train Large Language Models (LLMs) with 1.3 billion parameters.

Background: The Illusion of State

To understand why Implicit Models are necessary, we first need to understand the limitations of current architectures.

The Limitation of Explicit Models

Most modern language models are “explicit.” This means they consists of a fixed stack of layers (e.g., 32 layers in Llama-2-7B). When a token passes through the network, it undergoes a finite, predetermined number of non-linear transformations.

Recent theoretical work has shown that this finite depth creates a ceiling on computational power. Specifically, Transformers and SSMs struggle with state tracking. Imagine a problem where you need to track the location of an object as it is moved between boxes A, B, and C over a sequence of 100 moves. If the logic requires updating a state based strictly on the previous state, standard parallel models often fail or require a number of layers that scales with the input length.

Classical RNNs don’t have this problem because their state evolves sequentially. However, as noted, we stopped using them because they don’t scale on modern hardware.

Deep Equilibrium Models (DEQs)

The researchers drew inspiration from Deep Equilibrium Models (DEQs). Unlike a standard network where $y = f_3(f_2(f_1(x)))$, a DEQ defines its output implicitly. It asks: “What is the vector $z^*$ such that if I run it through the network again, it doesn’t change?”

Mathematically, we look for the fixed point $z^*$ of a function $F_\theta$:

Fixed point equation for DEQs.

Finding this $z^*$ usually involves iterating the function over and over (self-iteration) until the values stabilize. This implies that the “effective depth” of the network adapts to the difficulty of the input. Easy inputs converge fast; hard inputs take longer.

The Core Method: Implicit SSMs

The researchers propose merging the architecture of State-Space Models (specifically Mamba2) with the infinite-depth philosophy of DEQs.

The Architecture

In a standard SSM, a hidden state $h_t$ is updated using a linear recurrence: $h_t = \Lambda h_{t-1} + u_t$. This linearity is what allows SSMs to be parallelized (using algorithms like parallel scan). But it is also what limits their expressivity.

The Implicit SSM modifies this by introducing a “depth” variable $s$. We process the sequence over time $t$, but at each timestep, we also iterate “vertically” through depth $s$ until convergence.

The update rule becomes a fixed-point iteration:

Implicit SSM iteration equations.

Here:

$t$ is the time step (sequence position).
$s$ is the iteration step (depth).
$h_t^{(s)}$ is the hidden state.
$z_t^{(s)}$ is the layer output (or “thought” vector).

Notice that the transition matrices $\Lambda$ and inputs $u$ now depend on the previous iteration’s output $z_t^{(s-1)}$. This couples the state evolution to the depth iteration.

Two Modes of Operation: The “Duality”

One of the paper’s most significant contributions is defining two distinct ways to compute these fixed points. This duality allows the model to be efficient in different contexts.

1. Simultaneous Mode (Best for Training) In this mode, we iterate the entire sequence at once. We update all tokens ($t=1$ to $T$) for iteration $s=1$, then all tokens for $s=2$, and so on. Because the underlying SSM core is parallelizable, each iteration $s$ is fast. This allows the model to be trained efficiently on GPUs.

2. Sequential Mode (Best for Inference) In this mode, we solve the fixed point for token $t=1$ completely (looping $s$ until convergence), then pass the final state to $t=2$, solve for $t=2$, and so on. This behaves exactly like an RNN.

The image below visualizes this beautiful duality. The “Simultaneous” mode (A) allows trajectories to interact during convergence, while the “Sequential” mode (B) processes one token at a time.

A: Simultaneous mode iterating entire sequence. B: Sequential mode iterating per token. C: Perplexity difference between modes.

The graph in Panel C confirms that both modes produce nearly identical results (low perplexity difference), proving they are functionally equivalent.

Theoretical Proof: It Is an RNN

Why go through this trouble? The researchers provide a theorem proving that at convergence (when $s \to \infty$), the linear limitations of the SSM vanish.

In the limit, the fixed point variables $h^*$ and $z^*$ satisfy:

Fixed point limit equations.

The researchers derived the Jacobian (the rate of change) of the state $h_t^*$ with respect to the previous state $h_{t-1}^*$. In a standard SSM, this Jacobian is diagonal and linear. In an Implicit SSM, the Jacobian becomes:

State-to-state Jacobian equation.

This equation shows that the transition is non-linear and non-diagonal. The state evolution depends on complex interactions between the hidden state and the input. In layman’s terms: The Implicit SSM has theoretically transformed itself into a non-linear RNN, gaining the computational power to solve complex state-tracking problems that baffle standard Transformers.

We can visualize this non-linearity. The heatmap below compares the gradients derived via automatic differentiation (Autograd) vs. the theoretical formula above. The presence of off-diagonal elements (the colorful patterns) confirms the model is learning complex state dependencies.

State-to-state Jacobian heatmaps comparing Autograd and Theorem.

Training with Phantom Gradients

Training a model with “infinite” depth sounds memory-intensive. If you unroll a loop 100 times, you normally need to store the activations for all 100 steps to perform backpropagation. This would crash the memory of even the biggest GPUs.

To solve this, the authors utilize Phantom Gradients. This technique relies on the Implicit Function Theorem. It states that you don’t need to backpropagate through the path taken to reach the fixed point; you only need to calculate gradients at the final fixed point.

Fixed-point iteration and phantom gradients illustration.

As shown above, the forward pass iterates until convergence (the loop on the left), but the backward pass (right) only considers a small, fixed number of steps ($k$) at the solution. This decouples memory usage from the number of iterations, allowing the model to “think” for as long as needed without consuming extra memory.

Experiments and Results

The theory is sound, but does it work? The authors tested the model on synthetic logic puzzles and large-scale language modeling.

1. The $S_5$ Word Problem

This is a benchmark specifically designed to break Transformers. It involves computing the composition of permutations from the symmetric group $S_5$. It requires strict, non-commutative state tracking.

Standard Mamba2 models fail this task as sequences get longer. They simply run out of “layers” to track the state changes.

The Implicit Mamba2, however, excels. The figure below (Left panel) shows high accuracy on Out-Of-Distribution (OOD) data. The Middle panel is particularly interesting: it shows that you only need a small cap on iterations during training (e.g., 8 iterations) to learn the general algorithm, which then generalizes to harder problems at test time.

Comparison of OOD accuracy and training curves for Implicit vs Explicit models.

The Right panel compares the implicit approach against “unrolled” Mamba (where you just stack layers). Implicit Mamba (Red) converges much faster and more reliably than unrolled versions.

2. Large Scale Language Modeling (1.3B Parameters)

The researchers didn’t stop at toy problems. They trained Implicit Mamba2 and Implicit Llama models up to 1.3 Billion parameters on 207B tokens of the Pile dataset. This is the largest implicit model trained to date.

Length Extrapolation One of the most desirable features of a language model is the ability to train on short sequences and perform well on long ones. Standard models usually see their perplexity (error rate) explode when processing sequences longer than their training window.

Figure 4 below demonstrates the Length Extrapolation capabilities. The implicit models (Red and Dark Red) maintain stable, low perplexity even as the token position extends far beyond the training context (the shaded gray area). The standard Mamba2 (Light Blue) degrades significantly.

Length extrapolation performance of per token perplexity.

Downstream Reasoning On common sense reasoning tasks (like LAMBADA, HellaSwag, and ARC), the Implicit models generally outperformed their explicit baselines.

Interestingly, the authors also utilized the CatbAbI dataset—a benchmark for reasoning over long stories (e.g., “Where was Fred before the kitchen?”).

Small-scale reasoning advantages of implicit SSMs on CATBABI.

Panel (c) in Figure 13 shows that as story length increases, the accuracy of Implicit Mamba2 (Red) stays perfect, while explicit Mamba2 (Blue) collapses. Panel (d) reveals the cost: the implicit model automatically increases its number of iterations (“Test Steps”) to handle the increased difficulty. This is Adaptive Compute in action—the model thinks longer when the problem gets harder.

Conclusion & Implications

The paper Implicit Language Models are RNNs bridges a major divide in deep learning. For years, we believed we had to choose between the training efficiency of Transformers/SSMs and the state-tracking power of RNNs. This work shows that by framing the model layer as a fixed-point iteration, we can have both.

Key Takeaways:

Implicit SSMs are RNNs: Through self-iteration, linear SSMs gain the non-linear state transitions of RNNs.
Duality: You can train in “Simultaneous Mode” (Parallel/Fast) and run inference in “Sequential Mode” (Low Memory/RNN-like).
Adaptive Compute: The model naturally uses more compute iterations for harder or longer sequences, solving the length generalization problem that plagues Transformers.
Scalability: This isn’t just theory; it works at the 1.3B parameter scale with phantom gradients ensuring memory efficiency.

This approach hints at a future where Large Language Models aren’t just static predictors but dynamic systems that can “ponder” a difficult prompt until they converge on a coherent answer, all while retaining the efficiency required for massive pre-training.

The Best of Both Worlds: How Implicit Models Bridge the Gap Between Transformers and RNNs

Introduction