Video generation is advancing at a breathtaking pace. We’ve gone from blurry, short clips to stunning, high-definition videos created from simple text prompts. Many modern models can animate a single still image, a task known as image-to-video (I2V), breathing motion into static content. But what if you want more control? What if you want to define not only how a video begins but also key moments in the middle and the exact ending you imagine?
Imagine being a director, not just an audience member. Instead of starting with one frame and letting the model do the rest, you might want to specify that a certain character appears at the center of the frame at the 10-second mark and that the video ends with a dramatic sunset 30 seconds later. Unfortunately, most existing models can’t handle this level of customization. They are built for rigid, isolated tasks: start from this image, fill in this missing region, or extend this clip forward in time. These are like specialized tools in a workshop—each powerful for one job, but limited in scope. What we really want is a universal paintbrush for video: a way to add elements anywhere and anytime on a spatio-temporal canvas.
That is exactly the challenge the researchers of VideoCanvas set out to solve. Their work introduces a unified framework that treats video generation as painting on a canvas extending through both space and time. With VideoCanvas, you can specify arbitrary image patches at any location and timestamp, and the model will synthesize a complete, coherent video around them.
Figure 1: VideoCanvas unifies diverse tasks—from patch-based synthesis to inpainting, outpainting, and even creative scene transitions—under one framework.
As shown above, this single concept unifies an entire suite of tasks: first-frame-to-video generation, interpolation, inpainting, outpainting, and seamless transitions between entirely different scenes. Even more impressively, the model performs all these tasks in a zero-shot manner, without specialized retraining for each one. Let’s explore how they achieved this breakthrough.
The Core Challenge: Temporal Ambiguity in Video Models
Most cutting-edge video diffusion systems rely on latent representations rather than raw pixels for efficiency. They compress videos using Variational Autoencoders (VAEs) before a diffusion transformer (DiT) performs the generation.
For images, this is simple—each image corresponds to one latent code. But videos are more complex. To reduce computation, video VAEs commonly use a causal temporal encoder that compresses several consecutive frames into one latent token. For example, frames 0–3 might all map to a single latent.
This compression is efficient but introduces a major obstacle: temporal ambiguity. If multiple frames share the same latent, how can the model modify something in just one of those frames? The mapping between pixel frames and latent representations becomes blurred, making frame-specific control nearly impossible.
Figure 2: Temporal ambiguity arises because multiple frames collapse into one latent during causal VAE encoding. Existing paradigms fail to address this fine-grained alignment problem.
Existing conditioning paradigms attempt to solve this but fall short:
- Latent Replacement overwrites entire latents with encoded frames. It can work for the first frame but collapses motion and breaks temporal coherence when extended elsewhere.
- Channel Concatenation adds condition features as extra input channels, but these require retraining VAEs to handle zero-padded frames—an expensive and risky process.
- Adapter-based or cross-attention injection adds heavy additional modules, disrupting scalability.
VideoCanvas solves these problems with a clever hybrid strategy built on In-Context Conditioning (ICC).
The VideoCanvas Method: A Hybrid Strategy for Space and Time
VideoCanvas builds on ICC—a paradigm that elegantly integrates conditions directly into the model’s token sequence. Instead of introducing specialized modules or new parameters, ICC treats both conditional and noisy latents as parts of a unified token sequence. Through self-attention, the model learns to interpret the conditional tokens as context while generating the missing video regions.
However, ICC alone doesn’t fix the temporal ambiguity caused by causal VAEs. VideoCanvas therefore introduces a hybrid conditioning strategy that decouples spatial and temporal control:
1. Spatial Control via Zero-Padding
For spatial alignment, VideoCanvas creates a zero-padded full-frame canvas where the user’s patch is placed precisely within the desired region. Zero-filled pixels indicate unconditioned areas, allowing the VAE to seamlessly encode the patch in context.
Critically, the authors discovered that VAEs are robust to spatial padding—unlike temporal padding, which breaks coherence. This insight allows full spatial control without retraining.
2. Temporal Control via RoPE Interpolation
To assign conditions to specific timestamps without confusing the causal VAE, VideoCanvas encodes each frame independently—treating it as a standalone image. This decouples the condition from the compressed video sequence.
Then comes the key innovation: Temporal RoPE Interpolation. RoPE (Rotary Positional Embeddings) naturally indicate sequence order in transformers. By interpolating these embeddings across continuous positions, VideoCanvas assigns fractional temporal indices to each conditional frame.
Formally:
\[ \text{pos}_t(z_{\text{cond},i}) = \frac{t_i}{N} \]where \(t_i\) is the target pixel-frame index and \(N\) is the temporal stride of the VAE. If \(N = 4\) and the condition lies at frame 41, the fractional position becomes \(10.25\)—providing a smooth, unambiguous temporal signal for the model.
The Full Pipeline
Figure 3: The VideoCanvas pipeline. Conditional patches are zero-padded and encoded individually. Fractional RoPE positions align them precisely along the timeline for unified generation.
In summary, the VideoCanvas pipeline works as follows:
Prepare Conditions: Create full-sized, zero-padded frames for each condition patch.
Encode Independently: Use a frozen VAE (in image mode) to obtain clean conditional latent tokens.
Construct Unified Sequence: Concatenate conditioned and noisy latent tokens:
\[ \boldsymbol{z} = \operatorname{Concat}(\{z_{\text{cond},i}\}_{i=1}^{M},\, z_{\text{source}}) \]Align Temporally: Assign integer positional indices to video latents and fractional RoPE positions to conditional tokens.
Denoise: The diffusion transformer treats conditional tokens as context, denoising only unconditioned regions to complete the video.
This unified approach enables precise spatial placement and frame-accurate temporal conditioning—without retraining the VAE or adding new parameters.
Experiments: Putting VideoCanvas to the Test
To evaluate this new paradigm, the authors created VideoCanvasBench, the first benchmark explicitly designed for arbitrary spatio-temporal video completion. It includes tasks for generating videos from patches (AnyP2V), from full frames (AnyI2V), and creative scenarios like transitions, painting, and camera control (AnyV2V).
Does Temporal RoPE Interpolation Really Work?
The cornerstone experiment examined whether Temporal RoPE Interpolation effectively resolved the causal VAE’s ambiguity. The authors tested several alignment methods:
Figure 4: Per-frame PSNR comparison for different temporal alignment strategies. VideoCanvas (red) peaks exactly at the target frame, indicating precise alignment.
- Latent-space Conditioning produced mostly static outputs.
- Without RoPE Interpolation shifted fidelity peaks, misaligning target frames.
- Pixel-space Padding degraded quality due to zero-filled video input.
- VideoCanvas (RoPE Interpolation) achieved perfect temporal alignment and high fidelity.
Figure 5: Pixel-space padding introduces artifacts, while VideoCanvas maintains color and texture integrity through RoPE alignment.
Paradigm Showdown: ICC vs. The Competition
To compare conditioning strategies directly, the team tested Latent Replacement, Channel Concatenation, and In-Context Conditioning (ICC) on identical backbones. They evaluated metrics such as PSNR, Fréchet Video Distance (FVD), Dynamic Degree, and user preference.
Results from VideoCanvasBench were clear:
- Latent Replacement: High PSNR but extremely low motion (Dynamic Degree)—videos appeared frozen.
- Channel Concatenation: Improved motion yet poorer fidelity and perceptual quality.
- VideoCanvas (ICC): Balanced fidelity and dynamics, achieving top perceptual scores and user preference above 65%.
Figure 6: Qualitative comparison. ICC maintains identity and smooth motion, avoiding static frames and unnatural morphs seen in other paradigms.
Human evaluators overwhelmingly favored ICC-generated videos for their realism, temporal consistency, and narrative flow.
A New Canvas for Creativity
VideoCanvas is more than a technical contribution—it reimagines what controllable video generation can be. By treating the entire spatio-temporal domain as a unified canvas, the model enables creative and practical applications beyond existing frameworks:
- Any-Timestamp Control: Define keyframes at arbitrary points to guide motion and narrative, not just the first frame.
- Spatio-Temporal Painting: Generate full videos from sparse patches scattered across time and space.
- Creative Transitions: Seamlessly morph between unrelated scenes (e.g., a drone turning into a butterfly).
- Infinite Extension & Looping: Autoregressively extend clips into long videos or generate perfect seamless loops.
- Virtual Camera Control: Emulate cinematic movements like pans, zooms, and tilts by manipulating conditioned patches.
Figure 7: VideoCanvas uses the same model to perform transitions, extensions, and camera movements under a single unified paradigm.
Conclusion: A Universal Framework for Controllable Video
VideoCanvas introduces and formalizes the concept of arbitrary spatio-temporal video completion. Its elegant fusion of In-Context Conditioning (ICC) with Spatial Zero-Padding and Temporal RoPE Interpolation solves the long-standing problem of fine-grained, frame-level control within causal video VAEs—without retraining or architectural changes.
By unifying tasks previously treated as separate–image-to-video, inpainting, outpainting, extension, and transition–into a single framework, VideoCanvas lays the foundation for the next generation of controllable video synthesis. It transforms video creation from rigid task boundaries into a fluid process of painting across time and space.
As video generation technology continues to evolve, VideoCanvas shows us a glimpse of what artistry might look like in the era of generative modeling: not merely watching AI-created videos, but composing them—moment by moment—on an infinite, dynamic canvas.