Introduction

Imagine trying to track a friend in a crowded video. Sometimes you know what they look like (a visual reference), and sometimes you only know a description, like “the person wearing a red hat.” Now, imagine the video is long. Your friend changes pose, walks behind a tree, or takes off the hat. To keep tracking them effectively, you need memory. You need to remember their history to predict where they are now.

In the field of Computer Vision, this task is known as Vision-Language Tracking (VLT). For years, Transformer-based models have been the gold standard here because of their powerful attention mechanisms. However, Transformers have an “Achilles’ heel”: they struggle to efficiently model long-term temporal information. As a video gets longer, the computational cost of remembering the past grows quadratically (\(O(N^2)\)).

Most current trackers try to solve this by taking “snapshots” of the past and updating the reference discretely. It works, but it’s clunky and disconnects the temporal flow.

Enter MambaVLT, a new approach proposed by researchers from the Harbin Institute of Technology and Peng Cheng Laboratory. By leveraging the State Space Model (SSM) architecture known as “Mamba,” this paper proposes a tracker that evolves continuously over time. It mimics how humans track objects—not by taking discrete snapshots, but by maintaining a fluid, evolving understanding of the target.

In this post, we will deconstruct MambaVLT. We will explore how it uses state space models to achieve linear complexity (\(O(N)\)) and how it fuses visual and linguistic data to track objects more robustly than state-of-the-art Transformers.

Figure 1. Illustration of two ways for capturing temporal context information. (a) Vision-language tracker with discrete context prompt. (b) Our MambaVLT with continuous time-evolving state space for temporal information transmission.

As shown in Figure 1, distinct from previous methods that rely on discrete context updates (a), MambaVLT (b) maintains a continuous state space memory. This allows the model to “evolve” its understanding of the target frame-by-frame without restarting the context process.

Background: The Shift to State Space Models

Before understanding MambaVLT, we need to understand the engine powering it: the State Space Model (SSM).

While Transformers use Self-Attention to look at every token in a sequence at once, SSMs are more like Recurrent Neural Networks (RNNs). They process data sequentially but are mathematically designed to be much more efficient and stable.

The Core Math of SSMs

An SSM views a system (like a video tracker) as a continuous process mapping an input sequence \(x(t)\) to an output sequence \(y(t)\) through a hidden state \(h(t)\). The hidden state is the “memory” of the system.

The continuous evolution is described by this differential equation:

Equation of continuous system mapping input to output with hidden state.

Here:

  • \(h(t)\) is the hidden state (the memory).
  • \(x(t)\) is the input (current video frame/features).
  • \(\mathbf{A}\), \(\mathbf{B}\), and \(\mathbf{C}\) are matrices that define how the state updates and translates to output.

However, computers can’t process continuous time; they operate in discrete steps (frames). To make this work for deep learning, the authors use a discretization parameter \(\Delta\) (delta) to transform the continuous matrices into discrete ones:

Discrete counterpart equations using zero-order hold discretization.

Enter Mamba: The Selective SSM

Standard SSMs are static; the matrices \(\mathbf{A}\) and \(\mathbf{B}\) don’t change based on the input. This makes them fast but not very smart at handling dynamic context. Mamba introduces “Selectivity.” It allows the model to change the parameters \(\mathbf{B}\), \(\mathbf{C}\), and \(\Delta\) based on the current input \(x_i\).

Selective State Space Model equations showing parameters depending on input i.

This selectivity allows Mamba to filter out irrelevant information (noise) and remember relevant information (the target) over long sequences with linear complexity. MambaVLT exploits this property to memorize the history of a target object throughout a video.

The MambaVLT Architecture

The goal of MambaVLT is to track a target defined by a text description (“cat colored in brown…”) and a visual template (the first frame bounding box).

The overall framework is elegant. It extracts features from both the text and the images (template and search region), fuses them into a unified sequence, and then processes them through a Time-Evolving Multimodal Fusion Module.

Figure 2. Overview of the MambaVLT architecture showing feature extraction, fusion, and localization.

As illustrated in Figure 2, the architecture has three key stages:

  1. Feature Extraction: A Mamba-based text encoder processes the language description, while a Vmamba-based visual encoder processes the image frames.
  2. Time-Evolving Fusion: This is the core innovation. It fuses the modalities while updating the “memory” of the target.
  3. Localization: A prediction head finds the target in the search region.

Let’s zoom in on the core innovation: the Time-Evolving Multimodal Fusion (TEMF) module. This module contains two critical blocks: the Hybrid Multimodal State Space (HMSS) and the Selective Locality Enhancement (SLE).

1. Hybrid Multimodal State Space (HMSS) Block

The HMSS block is responsible for long-term memory and mixing the vision and language features.

In a Transformer, you might concatenate text and image tokens and use self-attention. In Mamba, the order of the sequence matters because it is autoregressive (it reads left-to-right). This creates a challenge: should we put the text first or the image first?

The authors realized that scan order determines guidance. If you scan text before the image, the text features “guide” the image update. If you scan the visual template before the search region, the template guides the search.

To get the best of both worlds, the HMSS block uses a Modality-Guided Bidirectional Scan.

Figure 3. Overall pipeline of the Hybrid Multimodal State Space Block and Selective Locality Enhancement Block.

As shown in the left side of Figure 3, the model splits into two paths:

  1. Text-First Scan (\(\alpha\)): The scan order prioritizes language features to guide the visual features.
  2. Template-First Scan (\(\beta\)): The scan order prioritizes the visual template to guide the search region.

The outputs of these two scans are combined to form the final feature representation.

The Time-Evolving Mechanism

Crucially, the hidden state \(h\) is not reset at every frame. The final state of the previous frame becomes the initial state of the current frame. This allows the model to carry temporal information across the entire video.

The initialization of the state space for the current time step \(t\) is a blend of a learnable state and the previous frame’s memory:

Equation for initial state space derivation.

The mathematical formulation for the bidirectional scan and update is:

Equations for modality-guided bidirectional scan and output generation.

Here, \(\mathbf{h}^\alpha\) and \(\mathbf{h}^\beta\) represent the hidden states from the text-first and template-first scans, respectively. By averaging their outputs (via \(\mathbf{C}\)), the model achieves a robust fusion of multimodal data.

2. Selective Locality Enhancement (SLE) Block

While the HMSS block handles long-term global history, the tracker also needs to focus intensely on the current frame to pinpoint the object. The SLE block (shown on the right in Figure 3) handles this.

Standard linear attention mechanisms often fail to capture complex local nuances. The SLE block introduces a Global Selective Map (\(A_l\)). This map is generated by a convolution operation on the HMSS output, extracting spatial importance weights.

This map is added to the input features before performing a local linear attention scan. This enhances specific regions of the image that are currently relevant, while keeping computational costs low (linear complexity).

Equations for Selective Locality Enhancement block.

3. Modality-Selection Module

Sometimes, visual tracking is unreliable (e.g., the object is occluded). Other times, the language description is ambiguous (“the dark car” when there are three dark cars). A good tracker knows which source to trust.

MambaVLT includes a Modality-Selection Module. It calculates “invariant clues” (\(I_l\) for language and \(I_z\) for vision) and uses a Mamba-based selector to weigh them.

Figure 4. Overview of modality-selection module showing how language and vision clues are aggregated and weighted.

As seen in Figure 4, the module generates weights \(w_l\) and \(w_z\). These weights determine how much the search region features should be refined by the text versus the visual template.

The effectiveness of this selection is visualized in the heatmaps below. Notice how the “After” column focuses much more tightly on the specific target described by the text or box, filtering out distractions.

Figure 6. Visualization of the similarity between reference token and search region before and after the modality-selection module.

Training Objectives

To train this architecture, the authors use a combination of losses. A key component is the Contrastive Loss, applied both within the same video (intra-video) and across different videos (inter-video). This forces the model to learn that the “cat” in frame 1 is the same entity as the “cat” in frame 100, but different from a “cat” in a completely different video.

The token-wise similarity for contrastive learning is calculated as:

Equation for token-wise similarity calculation.

And the contrastive loss function is:

Equation for contrastive loss calculation.

The total loss combines bounding box regression (\(\mathcal{L}_{bbox}\)), target scores, and contrastive losses:

Equation for total training objective.

Experiments and Results

The researchers tested MambaVLT on four major benchmarks: TNL2K, LaSOT, OTB99, and MGIT. They compared it against state-of-the-art Transformer trackers like TransT, OSTrack, and UVLTrack.

Quantitative Performance

MambaVLT achieved impressive results. In the “Tracking by Language & Bounding Box” setting (NL&BBOX), it set new records on TNL2K and OTB99.

Table 1. Comparison of our method with state-of-the-art approaches on TNL2k, LaSOT and OTB99 datasets.

It also showed strong performance on the MGIT dataset, outperforming previous methods significantly in precision.

Table 2. Comparison of our method with the latest approaches on the MGIT dataset.

The “Semi-Reference-Free” (SRF) Test

The most interesting experiment is the Semi-Reference-Free (SRF) tracking.

In standard tracking, the model usually gets to “peek” at the initial reference box or text constantly. To prove that MambaVLT actually memorizes the target via its state space, the researchers cut off access to the reference data after the first frame. The tracker had to rely solely on its internal memory evolution to keep tracking.

Figure 5. Qualitative comparison of NL&BBOX tracking task on two challenging sequences using SRF setting.

The green line in Figure 5 (MambaVLT) tracks the target closely. Remarkably, the blue line (MambaVLT in SRF mode) also tracks the target successfully, often outperforming the standard UVLTrack (purple line).

Furthermore, look at the tracking reliability curves in Figure B:

Figure B. Effectiveness analysis of the time-evolving state space memory in BBOX and NL tasks under semi-reference-free setting.

Even without constant reference input (the blue line), MambaVLT maintains high Intersection over Union (IoU) scores, validating that the Time-Evolving State Space is effectively acting as a long-term memory.

Efficiency

Finally, does Mamba live up to its promise of efficiency?

Figure A. Computational complexity comparison with different search region image scales.

Figure A plots the FLOPs (computational cost) against the image size. As the search region grows, Transformer-based methods (like UVLTrack) see their costs skyrocket (quadratic growth), eventually hitting “Out Of Memory” (OOM). MambaVLT (green line) scales almost linearly, remaining efficient even at large resolutions.

Ablation Studies

The authors also broke down the model to see which parts mattered most.

Table 3. Analysis of different components in MambaVLT

As shown in Table 3, removing the Time-Evolving Hybrid State Space (THSS) caused the biggest drop in performance, confirming that the continuous memory mechanism is the backbone of the system. Adding the Modality-Selection (MS) and SLE blocks provided further incremental gains.

Conclusion

MambaVLT represents a significant step forward in visual tracking. It addresses the inherent limitations of Transformers—specifically the difficulty in modeling long temporal sequences and the high computational cost—by adopting the Mamba State Space Model.

By treating tracking as a continuous evolution of state rather than a series of discrete updates, MambaVLT achieves:

  1. Linear Complexity: It processes long videos efficiently.
  2. Robust Memory: It retains target identity even when reference inputs are removed (proven by the SRF test).
  3. Adaptive Fusion: It intelligently weighs vision and language based on the current context.

For students and researchers in Computer Vision, this paper is a strong signal that SSMs are not just a curiosity—they are a viable, high-performance alternative to Transformers for temporal modeling tasks.

Figure C. Visualized results of the MambaVLT and the UVLTrack method on six challenging sequences with drastic changes.