The world of AI video generation is evolving at lightning speed. Models like OpenAI’s Sora, Google’s Veo, and others are producing clips with breathtaking realism, often blurring the line between synthetic and real content. Yet, for all their power, most of these state-of-the-art systems share a frustrating limitation: they can only create short videos—typically capped at 5 to 10 seconds.

Why is that? The very architecture that makes them so powerful—the Diffusion Transformer (DiT)—is also their Achilles’ heel. Generating a video all at once is computationally monumental, and the cost increases exponentially with video length. It’s akin to trying to write an entire novel in one thought: theoretically possible, but wildly impractical.

A clever workaround is to generate videos autoregressively—producing one chunk at a time, based on the chunks that came before. This is much more scalable. But it introduces a sly, insidious problem: error accumulation. Small imperfections in one chunk propagate to the next, compounding until the video devolves into flickering, over-exposure, or outright frozen scenes.

This is where the new paper “Self-Forcing++: Towards Minute-Scale High-Quality Video Generation” comes in. The authors propose a simple yet profoundly effective method to tame this error cascade. By teaching a video model to fix its own mistakes, they enable it to generate high-quality, coherent videos lasting not seconds, but minutes.

Let’s unpack how they do it.

Stills from a four-minute video generated by Self-Forcing++, alongside charts showing its superior performance on quality metrics compared to other methods.

Figure 1: Self-Forcing++ can produce stunningly long videos, like this four-minute sequence of a plane over snowy mountains. The charts on the right show its superior consistency and motion dynamics compared to other methods.


The Challenge: Bridging the Training–Inference Gap

To appreciate the leap made by Self-Forcing++, you need to understand the fundamental mismatch it solves—between how autoregressive video models are trained and how they’re used at inference.

Most modern systems rely on a teacher–student distillation process. A massive and powerful “teacher” model, which generates all frames at once, is used to train a smaller, faster “student” model that produces videos autoregressively.

But this teacher model is itself limited—it was trained on short clips (typically 5 seconds long) and can only offer high-quality guidance for sequences of that length.

That leads to two problems:

  1. Temporal mismatch: The student only practices on 5-second clips, yet at inference it’s asked to generate much longer videos—30, 60, 120 seconds—situations it’s never seen during training.
  2. Supervision mismatch: During training, the student has perfect guidance from the teacher for every frame. At inference, there’s no teacher—so the first small error can cascade into severe artifacts. Prior methods like Self-Forcing improved short-form generation but were still bottlenecked by the teacher’s 5-second horizon, leading to static or darkened videos for longer sequences.

The crucial insight of Self-Forcing++:
What if we could force the student to make mistakes during training—and then use the teacher to show it how to recover?


The Core Method: Learning From Your Own Mistakes

Self-Forcing++ introduces a training loop explicitly designed to address the train–test gap. Instead of only training on the teacher’s perfect 5-second clips, it has the student generate long, flawed videos and then uses the teacher to correct pieces of them.

The training process, illustrated below, has several key steps.

A diagram comparing the training pipelines of CausVid, Self-Forcing, and Self-Forcing++. The Self-Forcing++ pipeline shows a long student rollout, backward noise initialization, and extended DMD alignment with a rolling KV cache.

Figure 2: In Self-Forcing++, the student generates a long “self-rollout,” from which short windows are sampled for teacher correction. This teaches the student how to recover from error accumulation across extended sequences.

Step 1: Long Rollouts

The student is tasked with generating long videos—say, 100 seconds. These will inevitably degrade over time: motion freezes, colors drift, or structures warp. Rather than avoid these flawed outputs, the researchers embrace them—they form ideal training examples of what can go wrong.

Step 2: Backward Noise Initialization

To involve the teacher, we need to create a noisy version of a segment from the flawed video, because diffusion models denoise starting from noise. But adding random noise would break temporal coherence.

Instead, Self-Forcing++ re-injects noise backwards into clean frames using the original diffusion noise schedule:

\[ x_t = (1 - \sigma_t)x_0 + \sigma_t \epsilon_t, \quad \text{where } x_0 = x_{t-1} - \sigma_{t-1} \hat{\epsilon}_{\theta}(x_{t-1}, t-1) \]

Where:

  • \(x_0\) is the clean frame from the student rollout,
  • \(\epsilon_t\) is Gaussian noise,
  • \(\sigma_t\) controls noise amount.

This method preserves the segment’s temporal grounding while making it noisy enough for teacher correction.

Step 3: Extended Distribution Matching Distillation (DMD)

From the long rollout (\(N\) frames), the algorithm samples a short window (\(K\), e.g., 5 seconds). Both student and teacher denoise this window.

The teacher, an expert in short-horizon denoising, produces a high-quality correction. The student’s output is compared to the teacher’s using a KL divergence loss. This sliding-window training repeats with windows at different positions, teaching the student how to recover at any point in a long video.

Mathematically:

\[ \nabla_{\theta} \mathcal{L}_{\text{extended}} \approx - \mathbb{E}_{t} \mathbb{E}_{i \sim \text{Unif}(1,\dots,N-K+1)} \left[ \int \left( s^{T} - s_{\theta}^{S} \right) \frac{dG_{\theta}(z_{i})}{d\theta} \, dz_{i} \right] \]

Step 4: Rolling KV Cache

Transformers use a “KV cache” to store past frame representations, avoiding recomputation. For long videos, this cache should roll—drop the oldest frames as new ones arrive.

Prior methods mistakenly used a fixed cache in training and a rolling cache when generating, creating a mismatch and visual artifacts. Self-Forcing++ uses a rolling KV cache consistently in both, eliminating this problem.


Rethinking Long-Video Evaluation

Self-Forcing++ also addresses flaws in how long videos are evaluated.

A comparison showing that the VBench benchmark’s image and aesthetic quality scores are unreliable, often rating over-exposed or degraded frames higher than well-exposed ones.

Figure 3: VBench can reward degraded or over-exposed frames with high scores, making it unreliable for evaluating long videos.

The common benchmark, VBench, uses older scoring models biased toward over-exposed imagery. The authors introduce Visual Stability, a metric leveraging the modern multimodal Gemini-2.5-Pro to detect over-exposure and degradation, scoring videos from 0–100.


Results: Outperforming the Competition

When tested on extended video generation (50s, 75s, 100s), Self-Forcing++ beats all baselines by a wide margin.

Table comparing the performance of various models on 5-second and 50-second video generation tasks. Self-Forcing++ shows top performance, especially in Dynamic Degree and Visual Stability for long videos.

Table 1: On 50-second videos, Self-Forcing++ (“Ours”) achieves massive gains in Dynamic Degree (motion persistence) and Visual Stability over baselines.

Table comparing the performance of various models on 75-second and 100-second video generation tasks. Self-Forcing++ maintains its lead as the video length increases.

Table 2: On 75s and 100s videos, baselines collapse or degrade; Self-Forcing++ maintains strong motion and stability.

Qualitative Comparison

A grid of video frames comparing different models over 100 seconds. Baselines like CausVid and Self-Forcing show severe quality degradation, while “Ours” (Self-Forcing++) remains stable and clear.

Figure 4: In a 100-second reef scene, baselines suffer over-exposure and detail loss, while Self-Forcing++ preserves vivid colors and structure throughout.


The Power of Scaling

One especially promising result: scaling training compute directly extends video length capability—without needing long-video datasets.

A grid showing how video quality improves with an increased training budget. At a 25x budget, Self-Forcing++ can generate a coherent 255-second video.

Figure 6: Increasing training budget improves stability. At 25× budget, the model produces a 255-second (4 min 15 sec) coherent video—50× longer than the baseline.

With the standard (1×) budget, videos break quickly. At 8× and 20×, motion coherence and detail persist. At 25×, the model generates a stable, high-fidelity elephant clip lasting 255 seconds—nearly the maximum length the base model’s position embeddings can support.


Conclusion: Towards True Long-Form AI Video

Self-Forcing++ elegantly closes the training–inference gap for autoregressive video models. By making the student face and fix its own errors mid-rollout, guided by a short-horizon teacher, it achieves stable minute-scale generation without long-video training data.

Key takeaways:

  1. Recovery training matters: Models must learn to correct accumulated errors for long-term stability.
  2. Short-video teachers can train long-video students: Teacher guidance on sampled degraded windows is enough to extend generation horizons dramatically.
  3. Scaling works: More compute yields longer, better videos without new datasets.

This is a pivotal step toward AI that can create not just fleeting clips, but entire, coherent scenes and narratives. Though challenges like long-term memory and training efficiency remain, Self-Forcing++ lays a strong foundation for the next generation of long-form video synthesis.