Papers

[Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models 🔗](https://arxiv.org/abs/2208.09399)

Beyond the Gaps: A Deep Dive into SSSD for Time Series Imputation and Forecasting

Introduction: The Problem of Missing Time Imagine you’re a doctor monitoring a patient’s heart with an ECG, but the sensor glitches and you lose a few critical seconds of data. Or perhaps you’re a financial analyst tracking stock prices and your data feed suddenly has gaps. Missing data is not just inconvenient—it’s a pervasive issue in real-world applications. It can derail machine learning models, introduce bias, and lead to flawed conclusions. ...

[VideoMamba: State Space Model for Efficient Video Understanding 🔗](https://arxiv.org/abs/2403.06977)

Beyond Transformers: How VideoMamba Unlocks Efficient Long-Video Understanding

The world of video is exploding. From bite-sized clips on social media to full-length feature films, we are generating and consuming more video content than ever before. For AI, truly understanding this content is a monumental task. A single video can contain mountains of spatiotemporal information—ranging from subtle gestures to complex, multi-minute narratives. The core challenge for modern video understanding models comes down to two conflicting needs: Efficiency — Video data is massive and often highly redundant. Models must process it quickly without exhausting computational resources. Global Context — Videos aren’t just isolated frames. Understanding them requires capturing dependencies that can span hundreds or thousands of frames. The Historical Trade-Off For years, two families of models have dominated: ...

[Hungry Hungry Hippos: Towards Language Modeling with State Space Models 🔗](https://arxiv.org/abs/2212.14052)

Hungry Hippos on the Pile: A New Challenger to the Transformer Throne

For the past several years, the Transformer architecture has been the undisputed champion of language modeling. From GPT-3 to PaLM, massive Transformer models have redefined the state of the art. But this power comes at a cost: the attention mechanism—at the heart of the Transformer—scales quadratically with sequence length. Processing a sequence twice as long takes four times the computation and memory. This makes working with very long documents, codebases, or audio files a significant challenge. ...

[LocalMamba: Visual State Space Model with Windowed Selective Scan 🔗](https://arxiv.org/abs/2403.09338)

Beyond Transformers: How LocalMamba Unlocks the Power of State Space Models for Vision

For years, computer vision has been dominated by two architectural titans: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). CNNs excel at capturing local features through sliding convolutional filters, while ViTs leverage self-attention to model global relationships across an entire image. Now, a new contender has emerged from the world of sequence modeling: the State Space Model (SSM), and in particular its modern, high-performing variant, Mamba. Mamba has shown remarkable prowess in handling long 1D sequences such as text and genomics, offering linear-time complexity and impressive performance. Naturally, researchers sought to bring its advantages to vision tasks. However, initial attempts such as Vision Mamba (Vim) and VMamba, while promising, have not decisively surpassed CNNs and ViTs. This raises a critical question: ...

[Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers 🔗](https://arxiv.org/abs/2110.13985)

The Swiss Army Knife of Sequence Models: A Deep Dive into Linear State-Space Layers

Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Transformers have revolutionized the way we process sequential data such as text, audio, and time series. Each paradigm is powerful, but each comes with its own limitations: RNNs are efficient at inference but train slowly on long sequences and suffer from vanishing gradients. CNNs train in parallel and are fast, but they struggle beyond their fixed receptive field and have costly inference. Transformers can capture global context but scale quadratically in memory and computation with sequence length. What if we could unify the strengths of these approaches? Imagine a model with: ...

[On the Parameterization and Initialization of Diagonal State Space Models 🔗](https://arxiv.org/abs/2206.11893)

S4, But Simpler: How Diagonal State Space Models (S4D) Match Performance with Less Complexity

Introduction: The Quest for Efficient Sequence Models Modeling long sequences of data—whether audio waveforms, medical signals, text, or flattened images—is a fundamental challenge in machine learning. For years, Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) were the standard tools. More recently, Transformers have risen to prominence with remarkable results. But all of these models face trade-offs, particularly when sequences get very long. Enter State Space Models (SSMs). A recent architecture called S4 (Structured State Space for Sequences) emerged as a powerful contender, outperforming previous approaches for tasks requiring long-range memory. Built on a solid mathematical foundation from classical control theory, S4 efficiently models continuous signals with a special state matrix called the HiPPO matrix—a mathematical design aimed at remembering information over long periods. ...

[Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model 🔗](https://arxiv.org/abs/2401.09417)

Vision Mamba: A New Challenger to Transformers for Computer Vision?

For the past few years, Vision Transformers (ViTs) have dominated computer vision. By treating images as sequences of patches and applying self-attention, these models have set new benchmarks in image classification, object detection, and semantic segmentation. However, this power comes at a steep computational cost. The self-attention mechanism at the core of Transformers suffers from quadratic complexity. In plain terms, if you double the number of image patches (for example, by increasing resolution), the computation and memory demands don’t just double—they quadruple. This makes high-resolution image processing slow, memory-hungry, and often impractical without specialized hardware or cumbersome architectural workarounds. ...

[Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality 🔗](https://arxiv.org/abs/2405.21060)

Mamba‑2 Explained: The Duality Connecting State‑Space Models and Attention

Transformers dominate many sequence-modeling tasks, but their core self-attention scales quadratically with context length. That design choice makes very long contexts expensive in compute and memory. At the same time, structured state-space models (SSMs) — exemplified by S4 and Mamba — offer linear scaling in sequence length and constant state for autoregressive generation. The two model families have matured along largely separate paths: different mathematics, different optimizations, and different engineering tradeoffs. ...

[VMamba: Visual State Space Model 🔗](https://arxiv.org/abs/2401.10166)

VMamba: A New Challenger to CNNs and Transformers in Computer Vision

For the past decade, computer vision has been dominated by two architectural titans: Convolutional Neural Networks (CNNs) and, more recently, Vision Transformers (ViTs). CNNs are celebrated for their efficiency and strong inductive biases toward local patterns, while ViTs, powered by the self-attention mechanism, excel at capturing global relationships in images. However, this power comes at a cost — the self-attention mechanism has quadratic complexity (\(O(N^2)\)) with respect to the number of image patches, making ViTs computationally expensive and slow, especially for high-resolution images common in tasks like object detection and segmentation. ...

From Atoms to Applications: Unpacking a Full-Featured 2D Flash Memory Chip

Introduction: The Nanoscale Revolution Waiting to Happen For over a decade, two-dimensional (2D) materials like graphene and molybdenum disulfide (MoS₂) have been the superstars of materials science. Thinner than a single strand of human DNA, these atomic-scale sheets possess extraordinary electronic properties that promise to revolutionize computing — from ultra-fast transistors to hyper-efficient memory. They represent a potential path to continue the incredible progress of Moore’s Law, pushing beyond the physical limits of silicon. ...