Beyond the Gaps: A Deep Dive into SSSD for Time Series Imputation and Forecasting

Introduction: The Problem of Missing Time

Imagine you’re a doctor monitoring a patient’s heart with an ECG, but the sensor glitches and you lose a few critical seconds of data. Or perhaps you’re a financial analyst tracking stock prices and your data feed suddenly has gaps. Missing data is not just inconvenient—it’s a pervasive issue in real-world applications. It can derail machine learning models, introduce bias, and lead to flawed conclusions.

For time series data—where temporal continuity and ordering are crucial—these gaps are particularly damaging. Most machine learning algorithms can’t tolerate missing values, so the usual remedy is imputation: filling in missing entries with plausible estimates. But what makes a good estimate? A simple average might wash out important peaks, while naive interpolation could miss underlying trends entirely. Poor imputations can corrupt downstream analysis.

The paper “Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models” introduces a new model, SSSD (Structured State Space Diffusion), designed to meet this challenge. It unites two powerful modern deep learning technologies:

Diffusion Models — State-of-the-art generative models that excel at creating realistic data by reversing a gradual noise-adding process.
Structured State Space Models (S4) — An efficient architecture for capturing long-range dependencies in sequences, often outperforming RNNs and Transformers.

By fusing these, the authors produce a model that achieves state-of-the-art results across diverse benchmarks, and that can excel even in the hardest scenarios—such as imputing large contiguous blocks of missing data—where prior methods have failed completely.

This article explains how SSSD works, from its foundations through its architecture and training strategy, and showcases experimental results demonstrating why it represents a major step forward in time series modeling.

Background: The Building Blocks of SSSD

Before diving into the architecture, it’s essential to understand the concepts underpinning SSSD: the types of missingness, the principles behind diffusion models, and the strengths of state space models.

Scenarios of Missingness

Not all missing data patterns pose the same difficulty. The paper focuses on four scenarios, illustrated below.

Different missingness scenarios for time series data. RM: random missing points. RBM: random blocks missing in different channels. BM: a single blackout block missing across all channels. TF: time series forecasting, a special case of BM.

Figure 1: Example missingness scenarios. Blue regions are known data; grey regions denote missing points to be imputed. Light/dark green bands represent prediction intervals from multiple imputations; orange is a sample imputation.

Random Missing (RM): Individual points are missing at random across the series—typically the easiest case, as nearby values can guide estimates.
Random Block Missing (RBM): Contiguous blocks of missing data, varying by channel.
Blackout Missing (BM): One contiguous block missing across all channels—a severe challenge, with no parallel channel data to leverage.
Time Series Forecasting (TF): A special case of BM where the missing block lies at the end of the sequence—the task is to predict future points.

SSSD targets all of these, with a particular strength in BM and TF cases.

Diffusion Models: Generating Data by Denoising

Diffusion models generate data by learning to gradually remove noise from a signal. Conceptually:

Start with a clean signal \(x_0\).
Forward process: Incrementally add Gaussian noise over \(T\) steps until the signal becomes pure noise \(x_T\). Formally,

\[ q(x_1, \dots, x_T | x_0) = \prod_{t=1}^T q(x_t | x_{t-1}). \]

Backward process: Train a model to reverse this, step-by-step, transforming \(x_T\) back to \(x_0\):

\[ p_{\theta}(x_0, \dots, x_{t-1} | x_T) = p(x_T) \prod_{t=1}^T p_{\theta}(x_{t-1} | x_t). \]

A key simplification is to have the network \(\epsilon_{\theta}(x_t, t)\) predict the noise \(\epsilon\) added at each step, using a mean squared error loss:

\[ L = \min_{\theta} \mathbb{E}_{x_0 \sim \mathcal{D}, \epsilon \sim \mathcal{N}(0,1), t \sim \mathcal{U}(1,T)} \|\epsilon - \epsilon_{\theta} (\sqrt{\alpha_t} x_0 + \sqrt{1-\alpha_t} \epsilon, t)\|_2^2. \]

For imputation, conditional diffusion is used—the network gets the known data and a mask of missing points as conditioning signals, guiding the denoising to fill gaps consistently.

Structured State Space Models (S4): Mastering Long Sequences

SSMs map an input sequence \(u(t)\) to an output \(y(t)\) via a latent state vector \(x(t)\):

\[ x'(t) = A x(t) + B u(t), \quad y(t) = C x(t) + D u(t). \]

With a HiPPO-based initialization of \(A\), S4 achieves strong memory over long contexts. Discretized, these become convolution operations—efficiently parallelizable with FFTs—combining RNN-like temporal modeling with CNN efficiency. This makes S4 ideal for long-sequence tasks like time series imputation in diffusion setups.

The Core Method: Inside the SSSD Model

SSSD embeds S4 layers within a conditional diffusion framework, enhancing the denoising network’s ability to capture long-term dependencies.

The SSSDS4 Architecture

Based on the DiffWave audio model, SSSDS4 swaps DiffWave’s dilated convolutional layers for S4 layers.

The proposed SSSD-S4 model architecture, showing the integration of S4 layers within a U-Net like structure for the diffusion process.

Figure 2: SSSDS4 architecture. Pink blocks are S4 layers integrated into residual diffusion blocks.

Flow of information:

Inputs:
- Noisy sample \(\bar{x}\) at timestep \(t\).
- Conditioning \(C\): known values (with zeros in gaps) concatenated with the binary mask.
- Timestep embedding (via fully connected layers) indicating the noise level.
Residual Blocks with S4:
- First S4 layer processes input after adding the timestep embedding—modeling long-range temporal structure in noise.
- Second S4 layer after merging conditioning—aligning signal patterns with constraints.
Output: Noise prediction \(\epsilon_{\theta}\), used for loss computation or sampling.

Training Strategy: Focus on the Unknown

Two noise application strategies were tested:

\(D_0\): Noise entire signal—loss combines reconstruction of known parts and imputation of unknowns.
\(D_1\): Noise only the unknown regions—known parts remain clean, entered as conditioning.

\(D_1\) lets the model focus solely on imputing missing data. Experiments show \(D_1\) consistently outperforms \(D_0\).

Experiments and Results: Putting SSSD to the Test

SSSD was benchmarked for imputation and forecasting against strong baselines across diverse datasets.

ECG Imputation: Qualitative Leap

On the PTB-XL ECG dataset, SSSDS4 beat baselines decisively—especially in BM scenarios.

Table 1: Imputation results on the PTB-XL dataset. SSSD-S4 shows significant improvements, especially in the RBM and BM scenarios.

Table 1: PTB-XL imputation (MAE/RMSE). SSSDS4 yields markedly better scores in RBM and BM.

Visual comparisons reveal the magnitude of improvement.

Visual comparison of blackout imputations on an ECG signal. CSDI (top-left) fails to produce a meaningful signal, while SSSD-S4 (bottom-right) generates a realistic and accurate ECG waveform.

Figure 3: BM imputation for a healthy ECG lead. CSDI output is unrealistic; SSSDS4 closely matches ground truth.

CSDI fails entirely in BM—producing nonsensical output. SSSDS4 recreates realistic waveforms, including vital QRS complex timing and amplitude.

Pushing the Limits: High Sparsity & High Dimensionality

On MuJoCo (up to 90% missing), SSSDS4 excelled in extreme sparsity, cutting MSE by >50% versus the best baseline at 90% missingness.

Table 2: Imputation MSE on the MuJoCo dataset. SSSD-S4 excels at the highest missingness ratio of 90%.

Table 2: MuJoCo imputation MSE. SSSDS4 dominates the hardest setting.

On high-dimensional Electricity (370 channels), using channel-splitting, SSSDS4 reduced errors >50% relative to top baselines like SAITS.

Table 3: RM imputation results on the high-dimensional Electricity dataset. SSSD-S4 achieves state-of-the-art performance.

Table 3: Electricity RM imputation. Outstanding gains at 10% and 30% missingness.

Forecasting Performance

Forecasting is BM at sequence ends—SSSDS4 applies directly.

On PTB-XL ECG TF:

Visual comparison of forecasting on an ECG signal. SSSD-based models produce more accurate and confident predictions.

Figure 4: Forecasting on ECG. SSSD variants produce tighter uncertainty and match signal trends better.

On Solar, SSSDS4 cut MSE by 27% compared to the strongest baseline.

Forecasting results on the Solar dataset, showing a significant error reduction by SSSD-S4.

Table 4: Solar forecasting MSE. SSSDS4 outperforms specialized baselines.

On long-horizon ETTm1, SSSDS4 rivaled or beat specialized models like Informer and Autoformer across multiple forecast lengths.

Long-horizon forecasting results on the ETTm1 dataset. SSSD-S4 is highly competitive with specialized state-of-the-art forecasting models.

Table 5: ETTm1 forecasting (MAE/MSE). SSSDS4 shows robust long-horizon ability.

Conclusion and Implications

By merging the generative strengths of diffusion models with the long-sequence mastery of S4 layers, SSSDS4 delivers:

State-of-the-art performance: Consistently exceeds top models across datasets and missingness types.
Mastery of blackout scenarios: Produces realistic outputs where prior methods fail.
Architectural advantage: S4 layers capture essential long-term dependencies.
Training efficiency: The \(D_1\) noise-focus strategy significantly boosts results.

SSSD offers a robust, general-purpose framework for sequential data modeling—ready for deployment in healthcare, finance, climate science, and beyond where data integrity is critical. It doesn’t just fill gaps; it reconstructs the full picture with fidelity.

The authors have released the code at https://github.com/AI4HealthUOL/SSSD, inviting further exploration and application of this promising approach.

Introduction: The Problem of Missing Time#

Background: The Building Blocks of SSSD#

Scenarios of Missingness#

Diffusion Models: Generating Data by Denoising#

Structured State Space Models (S4): Mastering Long Sequences#

The Core Method: Inside the SSSD Model#

The SSSDS4 Architecture#

Training Strategy: Focus on the Unknown#

Experiments and Results: Putting SSSD to the Test#

ECG Imputation: Qualitative Leap#

Pushing the Limits: High Sparsity & High Dimensionality#

Forecasting Performance#

Conclusion and Implications#