Beyond Text-to-Video: How UniVideo Unifies Understanding, Generation, and Editing

If you’ve been following the world of AI, you’ve seen the incredible leap from generating text to creating stunning images from a simple prompt. But the next frontier has always been video. While models that turn text into short video clips are becoming more common, they often feel like one-trick ponies.

What if we wanted an AI that could not only generate a video but also understand complex instructions involving images, edit existing videos in sophisticated ways, and even follow a hand-drawn storyboard?

This is the challenge that today’s specialized video models face. Most are built for a single task, like text-to-video, and struggle to understand multimodal instructions such as “make this person from the image ride a bike in the video.” Editing is often a clunky, separate process requiring task-specific tools.

Enter UniVideo, a new framework from researchers at the University of Waterloo and Kuaishou Technology. It’s a powerful step toward a true multimodal AI assistant for the video domain. UniVideo isn’t just another text-to-video generator—it’s a unified system designed to seamlessly understand, generate, and edit videos using a rich mix of text, image, and video instructions. It can generate a video from a text prompt, create a video starring a character from a photo, edit an object out of a scene, or even bring a visual storyboard to life.

A collage showcasing UniVideo’s capabilities, including text-to-video, in-context generation from images, in-context editing, and free-form editing like green-screening or changing materials.

Figure 1: UniVideo is a unified system that can understand multimodal instructions and generate diverse video content.

In this deep dive, we’ll unpack the architecture that makes UniVideo tick, explore how it masters many different tasks at once, and examine the impressive results showing it matches or even outperforms specialized state-of-the-art models.

The Problem with Video Models Today

The AI world has seen a boom in unified models for images. Systems like Google’s Imagen or OpenAI’s DALL·E have been extended to understand and create images within a single, cohesive framework. They can chat about an image, edit it based on instructions, and generate new visual content. This kind of unification is powerful—it allows models to develop holistic understanding and transfer skills between tasks.

But video has lagged behind. Existing video models typically fall into two camps:

Single-Task Generators: These models excel at one thing, usually text-to-video. They use a text encoder to interpret a prompt and a generator to create the video. This limits them to text-only instructions and prevents them from using visual context, such as a reference image of a specific character.
Task-Specific Editors: Video editing models often use specialized modules or complex pipelines for different tasks (e.g., one for changing style, another for object removal). This makes them hard to scale and incapable of handling free-form, creative instructions.

Because of this fragmentation, advanced capabilities like generating a video from multiple reference images, performing complex edits with a single instruction, or understanding a visual prompt have been out of reach for any single model. UniVideo is designed to change that.

The Core Method: A Tale of Two Streams

How does UniVideo juggle so many tasks? The secret lies in its clever dual-stream architecture, which separates the job of understanding from generation.

The model consists of two main components, as shown in Figure 2:

Understanding Branch (MLLM): A Multimodal Large Language Model—essentially the “brain” of the system. It interprets user instructions, whether text, images, or videos, and extracts high-level semantic meaning.
Generation Branch (MMDiT): A Multimodal Diffusion Transformer—the “artist” responsible for rendering and editing the video frames.

A diagram of the UniVideo architecture, showing the MLLM understanding stream and the MMDiT generation stream.

Figure 2: The dual-stream architecture of UniVideo. The MLLM handles understanding, while the MMDiT performs generation, blending reasoning with visual fidelity.

When you give UniVideo a prompt—say, “Generate a video of the woman in holding the bag from in the scene from ”—both streams spring into action:

The MLLM processes text and visual inputs to understand intent. It determines who the woman is, what the bag looks like, and where she should appear.
The semantic features it extracts are passed through a trainable connector to align with the MMDiT’s understanding stream, guiding the generator with a coherent semantic plan.
Simultaneously, images and videos are encoded through a VAE (Variational Autoencoder) and sent into the MMDiT’s generation stream, ensuring rich, pixel-level accuracy for textures, faces, and environments.

This dual design keeps the model both smart and detailed. Older unified models compressed all inputs into a small token set, losing fine details. By combining a semantic stream and a visual stream, UniVideo achieves nuanced reasoning and high-fidelity reconstruction—critical for preserving identity and precision in video editing.

One Model, Many Tasks

UniVideo’s authors trained the model across a vast range of multimodal tasks—teaching a single model to handle everything instead of writing discrete tools for each job.

The training data spanned tasks such as:

Text-to-Image and Text-to-Video
Image-to-Video
Image Editing and Style Transfer
In-Context Video Generation (videos guided by reference images)
In-Context Video Editing (swapping, adding, or deleting objects)

Overview of the multimodal training data used for UniVideo.

Table 1: UniVideo was trained on tens of millions of samples across foundational vision–language tasks.

Researchers trained UniVideo in three major stages:

Connector Alignment: The pre-trained MLLM and MMDiT were frozen while a small connector module learned how to link them together smoothly.
Fine-tuning MMDiT: The generator was refined on high-quality image and video data for realism and coherence.
Multi-task Training: The model then practiced all tasks together, learning when and how to generate, edit, or stylize based on multimodal context.

Training hyperparameters for each of the three training stages.

Table 2: Each training stage builds capacity—from alignment to fine-tuning and multi-task mastery.

Understanding Visual Prompts

A standout capability of UniVideo is its understanding of visual prompts—like turning sketches or annotated screenshots into coherent video narratives.

The MLLM can interpret these illustrations, turning them into structured descriptions that guide the generator. This allows users to sketch storyboards, mark motion paths, or annotate scenes directly on images, and UniVideo translates those cues into videos.

A diagram showing how the MLLM interprets a visual storyboard and generates a dense caption to guide the MMDiT in video synthesis.

Figure 3: The MLLM translates hand-drawn prompts or annotated images into pixel-level instructions for the generator.

Instead of complex multi-model agents, UniVideo performs everything within one unified system—simplifying multimodal video generation dramatically.

Experiments and Results: Measuring UniVideo’s Impact

To validate UniVideo, the researchers benchmarked it against leading models across generation, understanding, and editing tasks.

Visual Understanding & Generation

Can UniVideo still compete in standard text-to-video and image-to-video tests? Absolutely. As shown in Table 3, the model’s MLLM maintains top-tier understanding scores while its generation matches specialized systems like HunyuanVideo.

Quantitative comparison of UniVideo against other models on visual understanding and video generation benchmarks.

Table 3: UniVideo achieves strong performance in both understanding and generation, comparable to specialized models.

In-Context Video Generation and Editing

UniVideo’s unique strength lies in handling in-context instructions—creating or editing videos using reference images.

Qualitative examples comparing UniVideo with other models on in-context generation and editing tasks.

Figure 4: UniVideo preserves identity and follows multimodal instructions better than competing models.

Human evaluations and automatic benchmarks confirm that UniVideo leads in subject consistency, maintaining identity across frames even in complex multi-character scenarios.

Quantitative results for in-context video generation, showing UniVideo outperforming other models, especially in subject consistency.

Table 4: UniVideo scores highest in subject consistency (SC), outperforming commercial and academic baselines.

For in-context editing, UniVideo’s advantage is usability—it’s mask-free. Users simply describe edits textually, such as “Replace the man’s shirt with a red one” or “Add the dog from the image to the scene.” Despite the lack of explicit mask guidance, UniVideo performs on par with SOTA models that require precise region annotations.

Quantitative results for in-context video editing. UniVideo is the only mask-free model and still achieves top-tier performance.

Table 5: Even without user-provided masks, UniVideo achieves leading performance—making editing natural and intuitive.

The Magic of Unification: Zero-Shot Generalization

Perhaps the most compelling outcomes stem from UniVideo’s zero-shot generalization—successfully performing tasks it was never trained on.

Researchers observed two powerful forms:

Unseen Free-Form Editing:
Although UniVideo’s training covered structured edits (add, delete, swap, stylize), it generalizes to creative free-form video instructions like “green-screen the man,” “make the woman look like glass,” or “turn the day into night.” These abilities transfer from extensive image-editing training data.
Task Composition:
UniVideo can combine tasks seamlessly—such as “replace the car with the one from this image and stylize it like this painting.” This compositional reasoning emerges naturally through unified training.

Examples of UniVideo’s zero-shot generalization to unseen editing tasks and novel task compositions.

Figure 5: UniVideo handles novel edits and complex combinations—even tasks unseen during training.

The unified design turns multimodal learning into robust generalization, where skills learned from images improve performance in videos.

Understanding Complex Visual Prompts

In more creative tests, UniVideo translated visual cues—hand-drawn annotations or storyboards—into vivid video scenes. The results demonstrate strong interpretation of motion, concept layout, and event transitions.

Qualitative examples of videos generated by UniVideo from various visual prompts and storyboards.

Figure 6: UniVideo can follow both rough sketches and annotated input images, transforming them into dynamic video sequences.

This capability showcases UniVideo’s advantage in blending vision with language reasoning—a step toward intuitive, multimodal creative control.

Why Dual-Stream Design Matters: Ablation Insights

To ensure the design’s effectiveness, the team ran targeted ablations.

Multi-task Learning vs. Single-task:
A version trained on isolated tasks underperformed. The joint-learning UniVideo significantly outshone it, especially for editing—proving the inter-task synergy of unified training.
Dual-Stream vs. Single-Stream:
When visual inputs were routed only to the MLLM and not to the MMDiT, identity preservation collapsed. The generation stream is therefore vital for maintaining realism.

Ablation study results showing that multi-task learning improves performance over single-task models.

Table 6: Multi-task UniVideo yields higher scores thanks to cross-task learning.

Ablation study results showing the critical importance of feeding visual information directly to the MMDiT for preserving subject consistency.

Table 7: Removing direct visual input to MMDiT drastically reduces subject consistency—validating the dual-stream design.

Conclusion: A Unified Future for Video AI

UniVideo represents a milestone in multimodal generative AI—pushing beyond task-specific video systems to a unified framework that balances reasoning and realism.

A table comparing the capabilities of UniVideo to other leading multimodal models, highlighting its unique, comprehensive feature set.

Table 8: UniVideo uniquely supports full-spectrum understanding, generation, and editing for both images and videos.

Key takeaways:

Dual-stream architecture blends semantic reasoning with detailed visual control.
Unified multi-task training builds generalizable skill, enabling zero-shot creativity.
Mask-free editing makes intelligent video manipulation remarkably user-friendly.

Though some limitations remain—like motion preservation or subtle over-editing—UniVideo paints a clear vision of where video AI is headed: toward unified, intuitive systems that can understand our language and visuals together, helping us bring ideas to life across text, image, and motion.

The Problem with Video Models Today#

The Core Method: A Tale of Two Streams#

One Model, Many Tasks#

Understanding Visual Prompts#

Experiments and Results: Measuring UniVideo’s Impact#

Visual Understanding & Generation#

In-Context Video Generation and Editing#

The Magic of Unification: Zero-Shot Generalization#

Understanding Complex Visual Prompts#

Why Dual-Stream Design Matters: Ablation Insights#

Conclusion: A Unified Future for Video AI#