Why Prompts Fail on Mamba: Introducing State-offset Tuning

If you have been following the recent developments in sequence modeling, you have likely heard of Mamba and State Space Models (SSMs). These architectures have emerged as powerful alternatives to Transformers, promising to solve the dreaded quadratic computational cost that plagues standard Attention mechanisms.

However, as we shift from Transformers to SSMs, we are discovering a friction point: our existing toolbox doesn’t always work. Specifically, the techniques we use to fine-tune Large Language Models (LLMs) efficiently—known as Parameter-Efficient Fine-Tuning (PEFT)—often fail when applied to Mamba.

In this post, we will dive into a recent research paper titled “State-offset Tuning: State-based Parameter-Efficient Fine-Tuning for State Space Models.” We will explore why popular methods like Prompt Tuning break down in SSMs and look at a novel, architecture-specific solution that outperforms existing techniques: State-offset Tuning.

The Problem: When Transformers Tools Don’t Fit SSMs

To understand the innovation of this paper, we first need to understand the problem.

In the world of Transformers, full fine-tuning (updating all parameters of a model) is prohibitively expensive for large models. This led to the rise of PEFT methods. Two of the most popular families of PEFT are:

Parameter-based methods: Like LoRA (Low-Rank Adaptation), which injects trainable low-rank matrices into the model.
Prompt-based methods: Like Prompt Tuning and Prefix-Tuning. These methods work by prepending “virtual tokens” (learnable vectors) to the input sequence.

Prompt-based methods work exceptionally well for Transformers. Because the Attention mechanism allows any token to attend to any other token, a prompt at the beginning of a sequence can influence the generation of a token thousands of steps later.

But SSMs are different.

SSMs are Recurrent Neural Networks (RNNs) at heart during inference. They process data sequentially, updating a hidden state \(h_t\) at each step. They do not have a global “attention map.” This leads to a phenomenon the researchers call forgetting. If you place a soft prompt at the beginning of a sequence in an SSM, its influence diminishes exponentially as the sequence progresses. The model effectively “forgets” the prompt instructions by the time it reaches the end of a long input.

This paper proposes a shift in perspective. Instead of trying to force prompt-based methods onto SSMs, the authors introduce a new family of techniques designed for the architecture: State-based PEFT.

The Core Concept: State-offset Tuning

The researchers introduce a method called State-offset Tuning. The intuition is simple yet profound: if the model tends to forget information introduced at the start, we should re-inject that adaptation signal at every single timestep.

The Architecture of Mamba and the Fix

Let’s look at how a standard Mamba (SSM) block works and how State-offset Tuning modifies it.

In a standard SSM, the hidden state \(h_t\) is updated based on the previous state \(h_{t-1}\) and the current input \(x_t\). The output \(y_t\) is then projected from this state.

Figure 1: Illustration of our proposed State-offset Tuning on a Mamba block. State-offset Tuning injects a trainable state-offset h’ at each timestep in the SSM module while keeping other parameters frozen.

As shown in Figure 1, State-offset Tuning keeps the massive “frozen” weights of the pre-trained model (in blue) untouched. Instead, it introduces a small, trainable parameter vector, denoted as \(h'\) (in red).

Crucially, this offset \(h'\) is added to the calculation at every timestep.

Standard Operation: The model computes the hidden state \(h_t\) based on standard SSM dynamics.
The Intervention: The method adds the learnable offset \(h'\) to the state.
The Result: The modified state is used to compute the output.

By injecting \(h'\) at every step, the method guarantees that the adaptation signal remains constant and doesn’t fade away, regardless of how long the sequence is.

The Mathematical Foundation

To appreciate why this works, let’s briefly look at the equations governing SSMs. A discretized SSM typically follows this form:

Equation showing the update rule for hidden state h_t and output y_t.

Here, \(\overline{A}\) governs how the state evolves (dynamics), and \(\overline{B}\) controls how the input influences the state.

Previous attempts to adapt SSMs used Initial State Tuning. This method optimized the starting state \(h_0\). While better than standard Prefix-Tuning, it still suffered from decay. The influence of the initial state is multiplied by \(\overline{A}\) at every step. Since \(\overline{A}\) usually stabilizes the system (having eigenvalues less than 1), the effect of \(h_0\) vanishes over time.

State-offset Tuning changes the equation effectively to this:

\[\widehat{y}_t = C_t(h_t + h')\]

or in a variation called State-offset Tuning (y):

\[\widehat{y}_t = y_t + y'\]

The table below summarizes the difference between Initial State Tuning (which decays) and State-offset Tuning (which is constant).

Table 1: State-based methods for S6. Highlighting that State-offset Tuning eliminates the time-dependent coefficient, ensuring a uniform effect.

Notice the top row of the table. Initial State Tuning involves the term \(\prod \overline{A}_i\), which represents the cumulative product of the state transition matrices. This is the mathematical culprit behind the “forgetting” problem. State-offset Tuning removes this dependency entirely.

Comparing PEFT Families

The researchers classify PEFT methods for SSMs into three buckets:

Parameter-based: Modifying weights (e.g., LoRA).
Prompt-based: Modifying inputs (e.g., Prefix-Tuning).
State-based: Modifying internal states (The authors’ proposal).

The image below provides a fantastic visual comparison of how these methods interact with the S6 block (the core component of Mamba).

Figure 2: Visual comparison of prompt-based methods and state-based methods in the S6 block.

On the bottom of Figure 2, you see Prefix-Tuning. It prepends information before the sequence starts. This relies on the model carrying that information forward through the recurrent bottleneck.

On the top, you see State-offset Tuning. It operates inside the recurrence. It doesn’t ask the model to “remember” the prompt; it manually inserts the prompt’s intent into the brain of the model at every tick of the clock.

The “Iterative Suffix” Connection

The authors provide an interesting theoretical insight. They prove that State-offset Tuning is mathematically equivalent to a concept they call Iterative Suffix-Tuning.

Imagine that instead of putting a prompt at the start (Prefix), you put a prompt token at the end of the sequence (Suffix). In a recurrent model, the last token has the most influence. Now, imagine you re-inserted that suffix token at every single step of the sequence. This would force the model to pay attention to it constantly.

The paper demonstrates that learning a state offset \(h'\) is effectively the same as learning a virtual suffix token \(x_{suffix}\) that is re-introduced at every timestep. This theoretical link solidifies why State-offset Tuning is the “correct” way to do prompt-like adaptation in recurrent systems.

Experiments and Results

Does this theory hold up in practice? The authors tested State-offset Tuning on Mamba (130M to 2.8B parameters) and Mamba-2 across several datasets, including:

Spider: A complex Text-to-SQL task (requires logic and syntax).
SAMSum: Dialogue summarization.
GLUE: A general language understanding benchmark.

They compared their method against full fine-tuning, LoRA, BitFit, Prompt Tuning, and Prefix-Tuning.

Performance Analysis

The results are summarized in Table 3 below.

Table 3: Experimental results for fine-tuning the SSM module (S6) of Mamba models. State-offset Tuning outperforms almost all other methods.

Key Takeaways from the Data:

Prompt Methods Fail: Look at the rows for “Prompt Tuning” and “Prefix-Tuning.” On difficult tasks like Spider, their performance is abysmal (e.g., Prompt Tuning gets 43.6 vs. Full Fine-tuning’s 66.2). This confirms the “forgetting” hypothesis.
State-offset Tuning (h) Wins: The proposed method (second from bottom) achieves 57.4 on Spider, significantly beating LoRA (56.3) and dominating the prompt-based methods. It consistently ranks as the best or second-best method after full fine-tuning.
Efficiency: The method “State-offset Tuning (y)” (bottom row) is particularly impressive. It adapts only the output projection bias. It uses only 0.01% of the parameters (compared to LoRA’s 0.46%) but still achieves highly competitive results, often beating LoRA on easier datasets like SAMSum.

Computational Overhead

One of the main selling points of LoRA is efficiency. However, LoRA introduces extra matrix multiplications. If you don’t merge the LoRA weights back into the main model (which is common when serving multiple users with different adapters), inference becomes slower.

State-offset Tuning is simply an element-wise addition of a vector. It is extremely cheap computationally.

Table 8: FLOP overhead across various model sizes. State-offset Tuning adds less than 0.03% overhead.

As Table 8 shows, the FLOP (Floating Point Operations) overhead for State-offset Tuning is negligible—less than 0.03%. In contrast, LoRA introduces over 30x more computational overhead during inference if weights aren’t merged.

Conclusion and Implications

This paper highlights a critical lesson in deep learning: architecture matters. As we move beyond the Transformer monopoly and explore efficient architectures like Mamba and other State Space Models, we cannot simply copy-paste the techniques of the past.

Prompt Tuning, a staple of the Transformer era, relies on the specific “all-to-all” connectivity of Attention. When applied to the recurrent nature of SSMs, it fails because of state decay.

State-offset Tuning offers a robust solution by respecting the mechanics of the SSM. By injecting the adaptation parameters directly into the state transition at every timestep, it ensures:

Consistency: The adaptation signal doesn’t fade.
Efficiency: It requires fewer parameters and less compute than LoRA.
Performance: It achieves results comparable to full fine-tuning on complex reasoning tasks.

For students and practitioners working with Mamba, this suggests that manipulating the internal state—rather than the input sequence—is the future of efficient adaptation.

The Problem: When Transformers Tools Don’t Fit SSMs#

The Core Concept: State-offset Tuning#

The Architecture of Mamba and the Fix#

The Mathematical Foundation#

Comparing PEFT Families#

The “Iterative Suffix” Connection#

Experiments and Results#

Performance Analysis#

Computational Overhead#

Conclusion and Implications#