Introduction

Imagine looking down a long stretch of asphalt on a hot summer day. The air shimmers, causing the scenery to wobble, blur, and distort. This phenomenon, known as atmospheric turbulence, is a chaotic combination of blurring and geometric warping caused by temperature variations affecting the refractive index of air. While this “heat haze” might look artistic to the naked eye, it is a nightmare for long-range imaging systems used in surveillance, remote sensing, and astronomy.

Restoring these images is incredibly difficult. Unlike simple motion blur, turbulence varies spatially (different parts of the image distort differently) and temporally (it changes every millisecond). Traditional methods using Convolutional Neural Networks (CNNs) struggle because their receptive fields are too small to capture the large-scale distortions. More recent Transformer-based approaches can capture global context but suffer from quadratic computational complexity, making them too slow and memory-hungry for high-definition video.

In this post, we dive deep into MambaTM, a novel framework presented by researchers from Purdue University. This paper proposes a two-pronged solution: a new deep learning architecture based on Selective State Space Models (Mamba) for linear computational complexity, and a Latent Phase Distortion (LPD) representation that effectively “learns” the physics of turbulence to guide the restoration process.

The Challenge: Why Turbulence is Hard to Fix

To understand why MambaTM is necessary, we must first understand the limitations of current technology. Atmospheric turbulence introduces two distinct types of degradation:

  1. Pixel Displacement (Tilt): The image warps and wobbles.
  2. Blur: High-frequency details are lost.

Mathematically, this is often modeled using Zernike polynomials—a sequence of mathematical functions used to describe wavefront aberrations. A simulator takes a clean image and a set of Zernike coefficients (random numbers describing the distortion) to generate a turbulent image.

\[ I \stackrel { \mathrm { d e f } } { = } g ( J ; \mathbf { a } ) = \sum _ { k = 1 } ^ { 1 0 0 } \psi _ { k } \circledast ( \boldsymbol { \beta } _ { k } \cdot \mathcal { W } ( J ; \mathcal { T } ) ) + n , \]

Equation 1: The mathematical model for turbulence degradation.

As shown in the equation above, the degraded image \(I\) is a function of the clean image \(J\) and the turbulence parameters \(\mathbf{a}\). The challenge is the inverse problem: given \(I\), finding \(J\).

Existing deep learning methods try to reverse this process, but they face a “trilemma” of speed, memory, and performance. Recurrent Neural Networks (RNNs) are fast but unstable. Transformers offer high quality but become exponentially slower as video length increases. This sets the stage for MambaTM, which aims to achieve the high quality of Transformers with the speed of simpler networks.

Innovation 1: Learning Latent Phase Distortion (LPD)

One of the most significant contributions of this paper is not just the restoration network, but how the researchers model the physics of the problem.

The Problem with Zernike Coefficients

Standard physics-based simulators rely on Zernike coefficients to represent phase distortion. A naive approach would be to train a network to estimate these coefficients from a blurry image and then use them to reverse the damage.

However, the researchers identify a critical flaw: Ill-posedness. Different combinations of Zernike coefficients can result in nearly identical visual degradation. If a neural network tries to predict the exact coefficients, it struggles to converge because there isn’t a unique “correct” answer for a given blurry patch. Furthermore, simulating turbulence using these coefficients involves large-kernel convolutions, which are computationally expensive and slow down training.

The Solution: Latent Phase Distortion (LPD)

Instead of predicting the Zernike coefficients directly, the authors propose learning a compressed, “latent” representation of the distortion. They achieve this using a Variational Autoencoder (VAE).

Figure 2: The learning scheme for LPD and ReBlurNet.

As illustrated in Figure 2, the process involves two steps:

  1. Zernike Encoder (The VAE): The network takes the physics-based parameters (Zernike coefficients and kernel size) and compresses them into a latent map (LPD), defined by a mean \(\mu\) and variance \(\sigma^2\). This forces the network to learn a distribution of distortions rather than hard numbers.
  2. ReBlurNet: A decoder network takes the clean image and this new LPD map to reconstruct the turbulent image.

This approach transforms the ill-posed problem into a well-posed one. The LPD map captures the effect of the turbulence rather than the ambiguous coefficients. Crucially, this LPD simulator is 50x faster than standard physics simulators and is fully differentiable, allowing it to be integrated directly into the restoration network’s training loop.

To ensure the learned representation follows a tractable distribution (Gaussian), the training includes a KL-divergence loss:

\[ \mathcal { L } _ { K L } = - \frac { 0 . 5 } { H \times W } \sum _ { i , j } ( \log ( \pmb { \sigma } _ { i , j } ^ { 2 } ) + 1 - \pmb { \mu } _ { i , j } - \pmb { \sigma } _ { i , j } ) \]

Equation 2: The KL Divergence loss ensures the latent space is well-behaved.

Innovation 2: MambaTM Architecture

With the physics of degradation effectively modeled via LPD, the researchers tackled the restoration architecture. Video restoration requires analyzing long sequences of frames to distinguish between moving objects and the jittery movement of turbulence.

Why Mamba?

The paper adopts the Selective State Space Model (SSM), popularized by the Mamba architecture. Unlike Transformers, which calculate attention between every pair of pixels (quadratic complexity \(O(N^2)\)), SSMs process data sequentially with a recurrent state (linear complexity \(O(N)\)).

The core mechanism of an SSM is described by these discretized evolution equations:

\[ \pmb { h } _ { t } = \bar { \pmb { A } } \pmb { h } _ { t - 1 } + \bar { \pmb { B } } \pmb { x } _ { t } , \quad \ b { y } _ { t } = \pmb { C } \pmb { h } _ { t } + \pmb { D } \pmb { x } _ { t } \]

Equation 7: The discretized State Space Model equations.

Here, the hidden state \(h_t\) evolves over time based on the input \(x_t\). The “Selective” part of Mamba means the parameters \(A, B,\) and \(C\) are not static; they change based on the input, allowing the model to selectively remember or forget information. This is crucial for turbulence, where the model needs to “remember” the static background while “forgetting” the random jitter.

The Network Structure

The MambaTM network is a multi-scale architecture designed to process video frames efficiently.

Figure 1: The MambaTM Network Architecture.

Figure 1 details the complete pipeline:

  1. Multi-Scale Encoder: The video is processed at different resolutions to capture both fine details and broad structures.
  2. Mamba Groups: The core processing units.
  3. LPD Guidance: This is the bridge between the two innovations. The network estimates the LPD (the distortion map) from the input video. This estimated LPD is then injected back into the Mamba blocks to “guide” the restoration.

Tackling 3D Video with 1D Scans

Mamba is inherently a 1D sequence model (like processing text). Video is 3D (Time \(\times\) Height \(\times\) Width). To bridge this gap, the authors utilize three distinct scanning mechanisms to flatten the video data without losing spatial-temporal relationships:

  • Space-First Scan (SFMB): Scans mostly along spatial dimensions, preserving local image features.
  • Time-First Scan (TFMB): Scans along the temporal axis, critical for analyzing how turbulence changes over time.
  • Local Hilbert Scan (LHMB): Uses a Hilbert curve (a space-filling curve) to scan pixels. This is a clever addition because standard raster scans (row by row) separate vertical neighbors. Hilbert scans preserve local neighborhood locality better when flattened to 1D.

Table 6 in the paper (shown below) highlights the ablation study, proving that combining these scan orders yields the best performance compared to using any single one in isolation.

Table 6: Ablation study showing the effectiveness of combining different scan orders and LPD guidance.

Joint Training and Loss Functions

The training strategy is a joint optimization problem. The network tries to do two things simultaneously:

  1. Restoration: Produce a clean image \(\hat{J}\).
  2. Re-degradation: Estimate the LPD and use it to re-create the turbulent image from the clean estimate.

This “re-degradation” loop acts as a self-supervised consistency check. If the network understands the physics correctly, it should be able to simulate the turbulence it just removed.

The total loss function combines restoration loss (Pixel-wise + Perceptual) and the re-degradation loss:

\[ \mathcal { L } = \mathcal { L } _ { r e s t o r e } + \alpha \mathcal { L } _ { r e t u r b } \]

Equation 6: The total loss function combining restoration and re-degradation objectives.

The restoration loss ensures the output looks like the ground truth:

\[ \mathcal { L } _ { r e s t o r e } ( \hat { \pmb { J } } , \pmb { J } ) = \mathcal { L } _ { c } ( \hat { \pmb { J } } , \pmb { J } ) + \alpha _ { p } \mathcal { L } _ { p } ( \hat { \pmb { J } } , \pmb { J } ) \]

Equation 4: Restoration loss utilizing Charbonnier and Perceptual loss.

While the re-degradation loss ensures the estimated physics parameters are accurate:

\[ \mathcal { L } _ { r e t u r b } = \mathcal { L } _ { c } ( \hat { I } _ { t i l t } , I _ { t i l t } ) + \mathcal { L } _ { c } ( \hat { I } _ { t u r b } , I ) + \alpha _ { k } \mathcal { L } _ { K L } \]

Equation 5: Re-degradation loss.

Experiments and Results

The researchers compared MambaTM against state-of-the-art methods on both synthetic and real-world datasets. The results demonstrate superiority in three key areas: reconstruction quality, speed, and generalization.

Quantitative Performance

In turbulence mitigation, standard metrics like PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity) are used.

Table 3: Performance on dynamic scene datasets showing MambaTM’s speed advantage.

As seen in Table 3, MambaTM achieves higher PSNR scores than competing methods like TMT and DATUM. However, the most striking column is FPS (Frames Per Second). MambaTM runs at 55.4 FPS, which is nearly double the speed of DATUM (32.7 FPS) and orders of magnitude faster than Transformer-based approaches like TMT (1.50 FPS). This makes MambaTM the first viable candidate for real-time high-resolution turbulence mitigation.

We can also look at the computational cost more closely:

Table 9: Computational cost comparison.

Table 9 confirms that MambaTM has significantly lower latency (0.030s) compared to other methods, validating the efficiency of the linear-complexity State Space Models.

Qualitative Results: Text and Objects

Numbers are useful, but visual results tell the real story. In text recognition tasks (a common benchmark for turbulence removal), MambaTM recovers legibility where other models fail.

Figure 3: Qualitative comparison on text restoration.

In Figure 3, notice the bottom row. The input is barely readable. While methods like TMT and DATUM improve it, MambaTM produces the sharpest, most distinct characters with the fewest artifacts (like color noise or ringing).

The method also shines in dynamic scenes involving moving objects, which are notoriously difficult because the network must distinguish between object motion and turbulence “motion.”

Figure 9: Comparison on real-world dynamic scenes.

In Figure 9, comparing the crop (H) of the blue car, MambaTM (d) restores the sharp edges of the vehicle and the pedestrian significantly better than DATUM (c), which leaves residual blur.

Understanding the Learned LPD

To verify that the LPD is actually learning meaningful physics, the authors visualized the latent maps.

Figure 7: Visualization of the LPD maps compared to Zernike coefficients.

In Figure 7, we see the Zernike-based simulation on top and the LPD-based simulation on the bottom. The visual similarity is striking. The heatmaps (d, e, f) show that the LPD captures the spatial intensity of the turbulence (where the blur is strongest) effectively, validating the use of the VAE to compress degradation physics.

Conclusion and Implications

The MambaTM paper represents a significant step forward in computational imaging. It successfully marries deep learning with physical principles in a way that is both accurate and efficient.

Key Takeaways:

  1. Efficiency matters: By using Selective State Space Models (Mamba), the authors achieved linear complexity, unlocking real-time performance for video restoration.
  2. Physics-aware learning: The Latent Phase Distortion (LPD) representation proves that we don’t need to force neural networks to predict exact physics parameters (like Zernike coefficients). Learning a latent representation that behaves like the physics is often more stable and effective.
  3. Global Context: Through novel scanning mechanisms (Space, Time, Hilbert), the model effectively processes 3D video data as 1D sequences without losing the vital spatiotemporal connections.

This work opens the door for real-time applications in long-range surveillance, autonomous navigation in hot environments, and potentially other inverse problems where the physical degradation model is complex or ill-posed. By moving away from heavy Transformers and towards efficient State Space Models, MambaTM sets a new baseline for high-performance video restoration.