Introduction
The race for generative video dominance has been one of the most exciting developments in artificial intelligence over the last few years. While diffusion models have become the standard for generating stunning static images, applying them to video—with its added dimension of time—has introduced massive computational bottlenecks and stability issues.
Most current video models treat video generation as an extension of image generation, often patching together spatial and temporal attention modules. However, a new contender has emerged from researchers at The University of Hong Kong and ByteDance. Named Goku, this family of models proposes a unified, industry-grade solution that handles both images and videos within a single framework.
What makes Goku special? It moves away from the traditional Denoising Diffusion Probabilistic Models (DDPMs) that have dominated the field, instead adopting Rectified Flow Transformers. By combining a massive, curated dataset with a mathematical formulation that simplifies the path from noise to data, Goku achieves state-of-the-art results on major benchmarks like VBench and GenEval.
In this post, we will tear down the architecture of Goku. We will explore how it curates data, why it ditches standard diffusion for “flow,” and how it manages to train on billions of parameters without running out of memory.
The Background: Why Fix What Works?
To understand Goku, we first need to understand the limitations of the current status quo.
Standard latent diffusion models generate data by gradually removing noise. Imagine a drunk person trying to walk home; they might take a winding, inefficient path. In mathematical terms, diffusion models often require many sampling steps to resolve an image or video from noise because the “trajectory” from the noise distribution to the data distribution is curved and complex.
Furthermore, training models to understand motion (video) and semantics (images) simultaneously is difficult. Many models train on images first and then “inflate” to video, often treating the two modalities as separate phases. Goku argues that a joint approach—treating images and videos as different expressions of the same visual data—coupled with a straighter, more direct generation path (Rectified Flow), is the key to higher quality and efficiency.
The Goku Pipeline: From Data to Generation
The Goku framework is built on three pillars: a rigorous Data Curation Pipeline, a Unified Model Architecture, and the Rectified Flow formulation.
1. The Data Curation Pipeline
A generative model is only as good as the data it sees. The researchers behind Goku emphasize that raw internet data is insufficient for industry-grade performance. They constructed a massive pipeline to filter, caption, and balance their training set, resulting in approximately 160 million image-text pairs and 36 million video-text pairs.

As shown in the figure above, the pipeline functions as a funnel:
- Collection & Extraction: Raw videos are collected and split into clips.
- Filtering: This is the most critical step. The system filters based on:
- Aesthetics: Using scoring models to keep only visually appealing content.
- Motion: Using optical flow to discard static videos or chaotic, shaky footage.
- OCR: Removing videos heavily cluttered with text (like overlays or credits).
- Captioning: Good captions drive good generation. Goku uses Multimodal Large Language Models (MLLMs) to generate dense, descriptive captions for videos, describing not just the scene but the camera movement (e.g., “pan right,” “zoom in”).
- Balancing: The raw distribution of internet video is heavily skewed. To prevent the model from only learning to generate common categories (like “people talking”), the data is semantically balanced to ensure rare concepts (like specific sports or animals) are represented.

2. The Core Architecture
Goku is not just one model but a family of Transformers scaling up to 8 billion parameters. It employs a Joint Image-Video Variational Autoencoder (VAE).
Compressing the Visual World
Videos are heavy. To train efficiently, Goku compresses raw pixels into a “latent space”—a compressed numerical representation.
- For Images: It compresses spatial dimensions.
- For Videos: It compresses both spatial dimensions (\(H \times W\)) and the time dimension (\(T\)).
This 3D-VAE allows the model to process a video clip as a sequence of tokens, similar to how an LLM processes a sentence of words.
The Transformer Backbone
Once the visual data is compressed into tokens, it is fed into the Transformer. Goku introduces several architectural tweaks to handle the complexity of video:
- Full Attention: Instead of separating “spatial attention” (looking at one frame) and “temporal attention” (looking across frames), Goku uses full attention. This allows the model to understand complex motion where objects change position and shape simultaneously.
- Patch n’ Pack: To handle videos of different lengths and resolutions in the same batch, Goku “packs” sequences together. This minimizes wasted computation on padding tokens.
- 3D RoPE (Rotary Positional Embeddings): This helps the model understand where a token is located in both space (X, Y) and time (T).

3. Rectified Flow: The Mathematical Engine
This is the most distinct feature of Goku. Instead of standard diffusion, it uses Rectified Flow (RF).
The core idea of Rectified Flow is to connect the noise distribution (start) and the data distribution (end) with a straight line.
The training objective is based on linear interpolation. If \(x_1\) is your real image/video and \(x_0\) is pure noise, the model learns to predict the “velocity” needed to travel from \(x_0\) to \(x_1\) along the path defined by:
\[ { \bf x } _ { t } = t \cdot { \bf x } _ { 1 } + \left( 1 - t \right) \cdot { \bf x } _ { 0 } , \]
Why does this matter? By forcing the path to be straight, the “velocity” the model needs to learn is constant. This simplifies the learning process significantly compared to the curved paths of diffusion models.
The researchers proved this efficiency in pilot experiments. As seen in the table below, the Rectified Flow version of the model converged faster and achieved better FID (visual quality) scores than the standard DDPM version with significantly fewer training steps (400k vs 1000k).

Training Strategy: A Multi-Stage Approach
You cannot simply throw 36 million videos at an 8-billion parameter model and expect it to work. Goku employs a multi-stage training strategy:
- Text-Semantic Pairing: The model first trains on text-to-image tasks. This establishes a strong understanding of visual concepts (what is a “cat”? what is “running”?).
- Joint Learning: The model trains on both images and videos. Images are treated as single-frame videos. This prevents the model from forgetting high-quality static details while learning motion.
- Cascaded Resolution: Training starts at low resolution (\(288 \times 512\)) to learn general composition and motion, then scales up to high resolution (\(720 \times 1280\)) to refine details.
Experimental Results
Does this complex pipeline and new math actually result in better videos? The benchmarks suggest yes.
Text-to-Image Performance
Even though Goku is a video-focused model, its joint training makes it an exceptional image generator. On the GenEval and DPG-Bench benchmarks (which test how well a model follows complex text prompts), Goku outperformed major competitors like DALL-E 2 and SDXL.

Qualitatively, the model shows a strong ability to render texture and complex object interactions, as seen in these sample outputs:

Text-to-Video Performance
The primary goal, however, is video. On VBench, a comprehensive suite for evaluating video generation across dimensions like “Motion Smoothness” and “Human Action,” Goku achieved the top spot on the leaderboard.

Visual comparisons highlight the difference. In the comparison below, models were asked to generate a surfer. While many models struggled with the physics of the wave or the consistency of the surfer’s body, Goku (bottom row) maintained a coherent figure and realistic water dynamics throughout the clip.

Similarly, in complex underwater scenes involving drones and coral reefs, Goku demonstrates superior temporal stability—meaning the background doesn’t flicker or morph weirdly as the camera moves.

Ablation: Do Size and Joint Training Matter?
The researchers conducted “ablation studies” to verify their design choices.
- Scaling: They compared a 2B parameter model vs. an 8B model. The 8B model showed significantly better structural integrity (limbs didn’t disappear, objects stayed solid).
- Joint Training: They tested training on video only versus video + images. The joint training approach produced much higher photorealism, as the high-quality image data helped “teach” the video model about textures and lighting.

Beyond Text-to-Video: Image-to-Video Animation
A crucial feature for creative professionals is Image-to-Video (I2V)—taking a static image and animating it based on a prompt.
Goku adapts to this by treating the input image as the first frame of the video latent sequence. Because the model was trained jointly on images and video, it naturally understands how to extend a static frame into time.
The results are impressive. Whether it is animating a splash of water or a pirate ship sailing inside a coffee cup, the model respects the initial image’s content while adding plausible motion.

Conclusion
Goku represents a significant step forward in generative media. It moves beyond the trial-and-error phase of early video generation into a more rigorous, engineered approach. By combining the theoretical efficiency of Rectified Flow with the practical robustness of Joint Image-Video Training, it solves several key problems in the field:
- Convergence Speed: Flow matching learns faster than diffusion.
- Coherence: Joint training ensures high fidelity in individual frames and smooth transitions across time.
- Scale: The infrastructure supports training massive 8B parameter models on high-resolution data.
For students and researchers, Goku offers a blueprint for the next generation of foundation models: don’t just add more data; fix the underlying mathematical formulation (Flow vs. Diffusion) and unify the modalities (Image + Video) for a smarter, more efficient learner.
](https://deep-paper.org/en/paper/2502.04896/images/cover.png)