Video restoration is a classic computer vision problem with a modern twist. We all have footage—whether it’s old family home movies, low-quality streams, or AI-generated clips—that suffers from blur, noise, or low resolution. The goal of Generic Video Restoration (VR) is to take these low-quality (LQ) inputs and reconstruct high-quality (HQ) outputs, recovering details that seem lost to time or compression.
Recently, diffusion models have revolutionized this field. By treating restoration as a generative task, they can hallucinate realistic textures that traditional methods blur out. However, this power comes at a steep price: computational cost.
Most diffusion-based video models are trapped by the resolution they were trained on. To handle high-resolution or long videos, they rely on “patch-based sampling”—chopping the video into overlapping tiles, processing them individually, and stitching them back together. This is incredibly slow. For example, existing state-of-the-art models like VEnhancer can take over 6 minutes to process just 1 second of video at HD resolution.
Enter SeedVR.
In a new paper titled “SeedVR: Seeding Infinity in Diffusion Transformer Toward Generic Video Restoration,” researchers propose a novel architecture that breaks these barriers. SeedVR is a Diffusion Transformer (DiT) designed to handle videos of arbitrary length and resolution without the need for slow, overlapping patches.

As shown in Figure 1, SeedVR achieves a “sweet spot”: it scores highest on perceptual quality metrics (DOVER) while maintaining inference speeds faster than models with significantly fewer parameters. In this post, we will deconstruct how SeedVR achieves this by redesigning the attention mechanism and the video autoencoder.
The Core Problem: Why is Video Restoration So Slow?
To understand SeedVR’s contribution, we first need to understand the bottleneck in current diffusion models.
Most modern generative models use an architecture called the U-Net or, more recently, the Diffusion Transformer (DiT). These architectures rely heavily on Self-Attention, a mechanism that allows the model to look at every pixel in a video frame to understand the context.
The problem is that standard “Full Attention” has quadratic complexity. If you double the resolution, the computational cost doesn’t just double; it quadruples (or worse). Because of this, models are trained on fixed, small crops (e.g., \(256 \times 256\) or \(512 \times 512\)).
When you try to use these models on a real-world \(1080p\) video, you cannot feed the whole frame in at once—you run out of GPU memory. The industry standard solution is Tiled Sampling:
- Cut the large video into small overlapping cubes (patches).
- Process each cube separately.
- Blend them together.
To avoid visible seams between cubes, you need a large overlap (often 50%). This means you are processing the same pixels multiple times, drastically slowing down inference.
The Solution: SeedVR Architecture
The researchers behind SeedVR took a different approach. Instead of using full attention (which limits resolution) or tiled sampling (which limits speed), they redesigned the fundamental building blocks of the network.
The architecture consists of two main innovations:
- Swin-MMDiT: A transformer block using “Shifted Window” attention to handle arbitrary resolutions.
- Causal Video VAE: A specialized autoencoder that compresses video more efficiently than previous methods.
Let’s break these down.
1. Swin-MMDiT: Handling Infinity with Windows
The backbone of SeedVR is the MMDiT (Multi-Modal Diffusion Transformer), an architecture popularized by Stable Diffusion 3. It processes visual data and text prompts (captions) simultaneously.
However, standard MMDiT uses full attention. SeedVR replaces this with Window Attention.
The Window Concept
Instead of calculating attention across the entire video frame (which is expensive), the model divides the video features into non-overlapping windows. Attention is calculated only within each window. This changes the complexity from quadratic (relative to total image size) to linear, making it incredibly fast regardless of resolution.
But there is a catch: if windows never interact, the model can’t see the “big picture,” leading to blocky artifacts.
The “Shifted” Mechanism
To fix this, SeedVR adopts a strategy inspired by the Swin Transformer: Shifted Windows.
- Layer N: The image is divided into regular grid windows.
- Layer N+1: The grid is shifted by half the window size.
This shifting ensures that a pixel at the edge of a window in one layer becomes the center of a window in the next, allowing information to propagate across the entire image over multiple layers.

Seeding Infinity: Large Windows and RoPE
As illustrated in Figure 2, SeedVR makes two critical adjustments to the standard Swin design to adapt it for generative video:
- Massive Windows: While standard image classifiers use small windows (e.g., \(8 \times 8\)), SeedVR uses a massive window size of \(64 \times 64\) in the latent space. This provides a huge receptive field, allowing the model to generate coherent textures and structures without needing global attention.
- 3D Rotary Positional Embeddings (RoPE): When you process a video of arbitrary size, the windows at the boundaries won’t always be perfect \(64 \times 64\) squares (e.g., the edge of a \(1080p\) frame). Standard learnable position embeddings fail here because they expect fixed sizes. SeedVR uses RoPE, which encodes position mathematically based on rotation. This allows the model to handle “variable-sized windows” at the edges naturally, enabling the processing of any resolution without padding or cropping.
2. Causal Video VAE: Compressing Time and Space
Before the video even reaches the Swin-MMDiT, it must be compressed. Diffusion models operate in a “latent space”—a compressed representation of the video—to save memory.
Existing methods typically take a Variable Autoencoder (VAE) trained on images (like the Stable Diffusion VAE) and “inflate” it to handle video. This is inefficient because it doesn’t compress the time dimension effectively.
SeedVR introduces a Causal Video VAE trained from scratch.

Why “Causal”?
“Causal” means the model only looks at past frames to generate the current one, never the future. This allows the model to process videos of infinite length by handling them chunk by chunk, without needing to know when the video ends.
Temporal Compression
As shown in Figure 3, this VAE compresses the video spatially by 8x (standard for images) but also temporally by 4x. This means a sequence of 4 frames is compressed into a single latent representation.
By reducing the amount of data the main Diffusion Transformer needs to process by a factor of 4, SeedVR achieves significant speed gains.
The performance of this new VAE is substantial. Looking at the comparison below (Table 2), the SeedVR VAE achieves the lowest rFVD (reconstruction Fréchet Video Distance), a metric measuring how closely the reconstructed video matches the original.

Large-Scale Training Strategies
Building the architecture is only half the battle. Training a 2.48 billion parameter model for video restoration requires sophisticated data strategies.
Mixed Image and Video Training
Training solely on video is computationally expensive and data-scarce compared to images. SeedVR is trained on a massive dataset of:
- 10 Million Images: Providing diversity in textures and objects.
- 5 Million Videos: Providing motion dynamics.
Because the Swin-MMDiT uses window attention, it can seamlessly switch between processing static images (treating them as 1-frame videos) and actual video clips during training.
Precomputing Latents
Encoding high-resolution videos into latent space takes time. If done during training, the GPU spends half its time just compressing video rather than learning diffusion. The researchers pre-computed the latent representations and text embeddings for their dataset, resulting in a 4x speedup in training.
Progressive Training
Instead of training on full-resolution videos immediately, the model is trained progressively:
- Start with short, low-res clips (\(5 \text{ frames} \times 256 \times 256\)).
- Move to medium clips (\(9 \text{ frames} \times 512 \times 512\)).
- Finish with long, high-res clips (\(21 \text{ frames} \times 768 \times 768\)).
This curriculum allows the model to learn basic concepts quickly before tackling fine details.
Experimental Results
How does SeedVR perform against the competition? The researchers tested the model on multiple benchmarks, including synthetic datasets (where degradation is artificially added) and real-world datasets (videos with natural low quality).
Quantitative Analysis
The table below summarizes the results. SeedVR achieves the highest scores in DOVER and LPIPS across almost all datasets.
- LPIPS (Lower is better): Measures perceptual similarity. Low scores mean the restoration looks like the ground truth to the human eye.
- DOVER (Higher is better): A metric specifically designed to evaluate the aesthetic quality of videos.

It is worth noting that for metrics like PSNR (Peak Signal-to-Noise Ratio), SeedVR is competitive but not always the winner. This is typical for generative models; PSNR favors blurry averages over sharp, hallucinated details. Since the goal of SeedVR is to generate realistic textures, perceptual metrics like DOVER are more relevant.
Visual Quality
The numbers look good, but the visual proofs are compelling. In the comparison below, observe the building windows (top row) and the panda’s nose (third row).

- Bicubic/ResShift: Often leave the image blurry or pixelated.
- SeedVR: Hallucinates plausible, sharp details. The fur on the panda and the geometric lines of the building are restored with high fidelity.
Efficiency
The most impressive stat might be the efficiency gained from the Window Attention mechanism.
The researchers analyzed training efficiency with different window sizes (Table 3). Using a larger window (\(64 \times 64\)) is actually faster than a small window (\(8 \times 8\)).

Why? Because in MMDiT, text embeddings interact with the visual window. If you have many small windows, you have to duplicate the text interaction many times. A larger window means fewer windows total, leading to less overhead and faster processing (20.29 seconds per iteration vs. 345.78 seconds).
Conclusion
SeedVR represents a significant step forward in generative video restoration. By abandoning the constraints of full attention and adopting Shifted Window Attention combined with a highly efficient Causal Video VAE, it solves the “impossible triangle” of video restoration: it achieves high resolution, arbitrary length, and fast inference simultaneously.
For students and researchers, SeedVR illustrates a vital lesson in deep learning system design: simply scaling up an existing architecture (like a standard U-Net) often hits a wall. Sometimes, you need to redesign the mechanism of how the model consumes data—in this case, switching from global to windowed attention—to unlock the next level of performance.
The result is a model that “seeds infinity,” theoretically capable of restoring a video of any length, bringing new life to old media.
](https://deep-paper.org/en/paper/2501.01320/images/cover.png)