For the last few years, the recipe for success in Artificial Intelligence has been deceptively simple: take a Transformer architecture, feed it massive amounts of data, and watch it learn. This “Attention is All You Need” paradigm has dominated not just Natural Language Processing (NLP), but also Computer Vision.
However, Transformers have a well-known Achilles’ heel: computational complexity. As the length of the input sequence grows, the computational cost increases quadratically (\(O(N^2)\)). This is a massive headache for Vision-Language Models (VLMs), where high-resolution images result in thousands of “visual tokens,” creating long sequences that slow down training and inference.
Enter Mamba, a new contender based on Structured State Space Models (SSMs). Mamba promises the holy grail: performance comparable to Transformers but with linear scaling (\(O(N)\)).
But can Mamba actually handle the complexity of multimodal data? Can we simply swap out the Transformer in a VLM for Mamba and expect it to “see”?
In this post, we are deep-diving into a fascinating research paper, “Shaking Up VLMs,” which conducts a rigorous, controlled face-off between Mamba and Transformers. The results are surprising, revealing a distinct split in “personality” between these two architectures—one is a master of summary, while the other is a master of detail.
The Problem: The Quadratic Bottleneck
To understand why this research matters, we first need to look at how modern VLMs work. Typically, a VLM consists of three parts:
- A Vision Encoder: Turns an image into a grid of feature patches (tokens).
- A Connector: Projects these visual tokens into the language model’s embedding space.
- A Large Language Model (LLM): Processes both the visual tokens and text tokens to generate an answer.
Most modern VLMs (like LLaVA or GPT-4V) use a Transformer as the LLM backbone. Because Transformers use a mechanism called Self-Attention, every token looks at every other token. If you double the resolution of your image, you quadruple the memory and compute needed. This limits how much “detail” a model can afford to see.
Mamba, on the other hand, is a Recurrent model. It processes tokens one by one, updating a fixed-size hidden state. It doesn’t need to look back at the entire history at every step; it just carries the relevant information forward. This makes it incredibly efficient.
The Methodology: A Fair Fight
Comparing architectures is notoriously difficult. If Model A beats Model B, is it because of the architecture, or did Model A just see better training data?
To solve this, the researchers set up a perfectly controlled experiment. They built two distinct VLM architectures:
- Pythia-VL: Uses a standard Transformer (Pythia) backbone.
- Mamba-VL: Uses a Mamba backbone.
Crucially, everything else was identical. They used the same vision encoder, the same connector, the exact same training data (a mix of 6.2 million image-text pairs), and the same training order. This isolation ensures that any difference in performance is due to the architecture itself.
The Architecture of Mamba-VL
Building a VLM out of Mamba isn’t as simple as plug-and-play. Mamba processes data as a 1D stream, but images are inherently 2D spatial structures.

As shown in Figure 1, the researchers utilized the EVA-02 vision encoder to create visual embeddings. These are passed through a simple MLP connector.
Because Mamba doesn’t natively understand 2D grids or “position” in the same way Transformers do (which use positional embeddings), the researchers had to be clever. They introduced special separator tokens:
##: Marks the start and end of an image.&&: Marks the end of a “row” of pixels in the flattened image sequence.
This structure helps the sequential Mamba model understand the spatial layout of the image.
Under the Hood: The State Space Equation
Why is Mamba efficient? It essentially functions as a sophisticated Recurrent Neural Network (RNN). In simple terms, it maps a sequence \(x(t)\) to an output \(y(t)\) through a hidden state \(h(t)\).

In a Transformer, “history” is preserved by keeping all previous tokens in memory (the KV cache). In an SSM like Mamba, history is compressed into the hidden state \(h_t\).

The defining feature of Mamba is that the parameters \(B\) and \(C\) are input-dependent. This allows the model to be “selective”—it can choose to remember relevant information and forget irrelevant noise at every time step.
The Experiments: Summary vs. Retrieval
The researchers evaluated the models on a wide spectrum of tasks. To understand the results, it helps to categorize these tasks into two buckets (as visualized in Figure 2):
- Coarse-Grained Tasks: These require a holistic understanding of the image (e.g., Image Captioning, Visual Question Answering). The model needs to summarize the “gist” of the scene.
- Fine-Grained Tasks: These require pinpointing specific details or locations (e.g., Visual Grounding, finding coordinates of an object).

Result 1: Mamba is a Great Narrator
When it came to tasks that required reasoning, summarizing, or answering general questions, Mamba-VL actually outperformed the Transformer.

Looking at Table 1, across different model sizes (790M, 1.4B, 2.8B), Mamba consistently edged out Pythia in:
- Image Captioning: Describing the scene.
- Visual Question Answering (VQA): Answering “What is the dog holding?”
- Reading Comprehension: Reading text inside images.
This suggests that Mamba’s “selective” state is excellent at compressing visual information into a semantic summary. It captures the “vibes” and the narrative of the image better than a Transformer of the same size.
Result 2: The Grounding Gap
However, the story changes dramatically when we look at Visual Grounding—tasks where the model must output the specific bounding box coordinates of an object (e.g., “Where is the blue flask?”).
Here, Transformers consistently won, and the gap widened as the models got larger. While Mamba could describe the flask, it struggled to point to it precisely.
Even when the researchers tried to help the models by increasing the image resolution (which usually improves performance), the gap remained.

As shown in Figure 3 (middle graph), increasing resolution to \(480 \times 480\) boosted Pythia’s grounding performance significantly, while Mamba saw much smaller gains.
Why Does Mamba Struggle with Grounding?
The researchers didn’t just stop at reporting the scores; they investigated why this divergence happens. They proposed two hypotheses.
Hypothesis 1: The “Task-Agnostic” Problem
Mamba processes data sequentially: [Image Tokens] -> [Text Instruction] -> [Response].
Because it updates its hidden state step-by-step, by the time it reaches the text instruction (e.g., “Find the red cup”), it has already processed and compressed the image.
If the model didn’t know what to look for while processing the image, it might have discarded the spatial location of the “red cup” as irrelevant noise. Transformers don’t have this issue because they can look back at the image tokens after reading the instruction using attention.
To test this, the researchers flipped the input order: [Text Instruction] -> [Image Tokens]. This is called Task-Aware Encoding.

The Result (Figure 4): It helped Mamba a little bit (average +1.53% on RefCOCO), but it wasn’t a magic fix. Pythia still performed better. The strict ordering of data wasn’t the only culprit.
Hypothesis 2: The “Retrieval” Problem
This leads to the core insight of the paper: Visual Grounding is actually a retrieval task.
To provide coordinates for an object, the model effectively needs to “copy” specific patch information from the input sequence to the output. Transformers are naturally good at this—Attention acts like a dictionary lookup. You query “blue cup,” and the Attention mechanism retrieves the exact vector from the image sequence.
Mamba, however, has to compress everything into a fixed-size state.
To prove this, the authors designed a Synthetic Grounding Task. They created a sequence of random unique tokens and asked the model to identify the position of a specific query token. It’s a “needle in a haystack” test for neural networks.

The results were damning for Mamba.

As Figure 6 shows, Pythia (Transformer) learns to solve this retrieval task almost instantly (the red lines shoot up to 100% accuracy immediately). Mamba (the blue lines) struggles significantly, taking much longer to learn, and sometimes failing to converge when the sequence length increases.
This confirms a fundamental limitation: State Space Models struggle to perform precise “copy-paste” retrieval from their context history. They are great at compressing summaries, but bad at keeping a perfect archive of every specific detail.
Heatmaps and Learning Patterns
The researchers went even deeper, visualizing where in the sequence the models were successfully retrieving information.

Figure 7 reveals the learning dynamics:
- Pythia (Top): The map is uniformly bright. It learns to retrieve information from anywhere in the sequence equally well, very early in training.
- Mamba (Bottom): It struggles. It first learns to retrieve items at the very end of the sequence (recent memory). Then it learns the beginning. It struggles most with the middle.
This “middle-muddle” is a classic symptom of recurrent models trying to manage a compressed memory state.
Conclusion: Feature or Bug?
The paper titled “Shaking Up VLMs” concludes with a nuanced take on the future of multimodal AI.
- Mamba is a contender: For tasks requiring high-level reasoning, captioning, and chat, Mamba is not just a viable alternative to Transformers—it’s often better and more efficient.
- The Retrieval Bottleneck: For tasks requiring precise spatial grounding or “pointing” at specific history, the compression mechanism of SSMs is a handicap compared to the “photographic memory” of Transformers.
The Implications
This research suggests that the future might not be “Transformer vs. Mamba,” but rather a hybrid of both. We might see architectures that use Mamba layers for efficient, high-level reasoning and massive context processing, interspersed with Attention layers to handle retrieval and grounding.
For students and researchers, this highlights a critical lesson in deep learning: Architecture is destiny. The inductive biases we build into our models—whether the global attention of Transformers or the sequential compression of SSMs—fundamentally dictate what they can and cannot perceive.
](https://deep-paper.org/en/paper/2409.05395/images/cover.png)