Introduction

The race for generative video dominance has been one of the most exciting developments in artificial intelligence over the last few years. While diffusion models have become the standard for generating stunning static images, applying them to video—with its added dimension of time—has introduced massive computational bottlenecks and stability issues.

Most current video models treat video generation as an extension of image generation, often patching together spatial and temporal attention modules. However, a new contender has emerged from researchers at The University of Hong Kong and ByteDance. Named Goku, this family of models proposes a unified, industry-grade solution that handles both images and videos within a single framework.

What makes Goku special? It moves away from the traditional Denoising Diffusion Probabilistic Models (DDPMs) that have dominated the field, instead adopting Rectified Flow Transformers. By combining a massive, curated dataset with a mathematical formulation that simplifies the path from noise to data, Goku achieves state-of-the-art results on major benchmarks like VBench and GenEval.

In this post, we will tear down the architecture of Goku. We will explore how it curates data, why it ditches standard diffusion for “flow,” and how it manages to train on billions of parameters without running out of memory.

The Background: Why Fix What Works?

To understand Goku, we first need to understand the limitations of the current status quo.

Standard latent diffusion models generate data by gradually removing noise. Imagine a drunk person trying to walk home; they might take a winding, inefficient path. In mathematical terms, diffusion models often require many sampling steps to resolve an image or video from noise because the “trajectory” from the noise distribution to the data distribution is curved and complex.

Furthermore, training models to understand motion (video) and semantics (images) simultaneously is difficult. Many models train on images first and then “inflate” to video, often treating the two modalities as separate phases. Goku argues that a joint approach—treating images and videos as different expressions of the same visual data—coupled with a straighter, more direct generation path (Rectified Flow), is the key to higher quality and efficiency.

The Goku Pipeline: From Data to Generation

The Goku framework is built on three pillars: a rigorous Data Curation Pipeline, a Unified Model Architecture, and the Rectified Flow formulation.

1. The Data Curation Pipeline

A generative model is only as good as the data it sees. The researchers behind Goku emphasize that raw internet data is insufficient for industry-grade performance. They constructed a massive pipeline to filter, caption, and balance their training set, resulting in approximately 160 million image-text pairs and 36 million video-text pairs.

The data curation pipeline in Goku. Given a large volume of video/image data collected from Internet, we generate high-quality video/image-text pairs through a series of data filtering, captioning and balancing steps.

As shown in the figure above, the pipeline functions as a funnel:

  1. Collection & Extraction: Raw videos are collected and split into clips.
  2. Filtering: This is the most critical step. The system filters based on:
  • Aesthetics: Using scoring models to keep only visually appealing content.
  • Motion: Using optical flow to discard static videos or chaotic, shaky footage.
  • OCR: Removing videos heavily cluttered with text (like overlays or credits).
  1. Captioning: Good captions drive good generation. Goku uses Multimodal Large Language Models (MLLMs) to generate dense, descriptive captions for videos, describing not just the scene but the camera movement (e.g., “pan right,” “zoom in”).
  2. Balancing: The raw distribution of internet video is heavily skewed. To prevent the model from only learning to generate common categories (like “people talking”), the data is semantically balanced to ensure rare concepts (like specific sports or animals) are represented.

The balanced semantic distribution of subcategries. Figure 3| Training data distributions. The balanced semantic distribution of primary categories and subcategories are shown in (a) and (b), respectively.

2. The Core Architecture

Goku is not just one model but a family of Transformers scaling up to 8 billion parameters. It employs a Joint Image-Video Variational Autoencoder (VAE).

Compressing the Visual World

Videos are heavy. To train efficiently, Goku compresses raw pixels into a “latent space”—a compressed numerical representation.

  • For Images: It compresses spatial dimensions.
  • For Videos: It compresses both spatial dimensions (\(H \times W\)) and the time dimension (\(T\)).

This 3D-VAE allows the model to process a video clip as a sequence of tokens, similar to how an LLM processes a sentence of words.

The Transformer Backbone

Once the visual data is compressed into tokens, it is fed into the Transformer. Goku introduces several architectural tweaks to handle the complexity of video:

  • Full Attention: Instead of separating “spatial attention” (looking at one frame) and “temporal attention” (looking across frames), Goku uses full attention. This allows the model to understand complex motion where objects change position and shape simultaneously.
  • Patch n’ Pack: To handle videos of different lengths and resolutions in the same batch, Goku “packs” sequences together. This minimizes wasted computation on padding tokens.
  • 3D RoPE (Rotary Positional Embeddings): This helps the model understand where a token is located in both space (X, Y) and time (T).

Table 1|Architecture configurations for Goku Models. Goku-1B model is only used for pilot experiments in Section 2.3

3. Rectified Flow: The Mathematical Engine

This is the most distinct feature of Goku. Instead of standard diffusion, it uses Rectified Flow (RF).

The core idea of Rectified Flow is to connect the noise distribution (start) and the data distribution (end) with a straight line.

The training objective is based on linear interpolation. If \(x_1\) is your real image/video and \(x_0\) is pure noise, the model learns to predict the “velocity” needed to travel from \(x_0\) to \(x_1\) along the path defined by:

\[ { \bf x } _ { t } = t \cdot { \bf x } _ { 1 } + \left( 1 - t \right) \cdot { \bf x } _ { 0 } , \]

Equation showing the linear interpolation formula used in Rectified Flow.

Why does this matter? By forcing the path to be straight, the “velocity” the model needs to learn is constant. This simplifies the learning process significantly compared to the curved paths of diffusion models.

The researchers proved this efficiency in pilot experiments. As seen in the table below, the Rectified Flow version of the model converged faster and achieved better FID (visual quality) scores than the standard DDPM version with significantly fewer training steps (400k vs 1000k).

Table 2| Proof-of-concept experiments of class-conditional generation on ImageNet 256x256 Rectified flow achieves faster convergency compared to DDPM.

Training Strategy: A Multi-Stage Approach

You cannot simply throw 36 million videos at an 8-billion parameter model and expect it to work. Goku employs a multi-stage training strategy:

  1. Text-Semantic Pairing: The model first trains on text-to-image tasks. This establishes a strong understanding of visual concepts (what is a “cat”? what is “running”?).
  2. Joint Learning: The model trains on both images and videos. Images are treated as single-frame videos. This prevents the model from forgetting high-quality static details while learning motion.
  3. Cascaded Resolution: Training starts at low resolution (\(288 \times 512\)) to learn general composition and motion, then scales up to high resolution (\(720 \times 1280\)) to refine details.

Experimental Results

Does this complex pipeline and new math actually result in better videos? The benchmarks suggest yes.

Text-to-Image Performance

Even though Goku is a video-focused model, its joint training makes it an exceptional image generator. On the GenEval and DPG-Bench benchmarks (which test how well a model follows complex text prompts), Goku outperformed major competitors like DALL-E 2 and SDXL.

Table 5| Comparison with state-of-the-art models on image generation benchmarks. We evaluate on GenEval (Ghosh et al., 2024); T2I-CompBench (Huang et al., 2023) and DPGBench (Hu et al., 2024). Following (Wang et al., 2024b), we use † to indicate the result with prompt rewriting.

Qualitatively, the model shows a strong ability to render texture and complex object interactions, as seen in these sample outputs:

Figure 7 | Qualitative samples of Goku-T2I. Key words are highlighted in RED.

Text-to-Video Performance

The primary goal, however, is video. On VBench, a comprehensive suite for evaluating video generation across dimensions like “Motion Smoothness” and “Human Action,” Goku achieved the top spot on the leaderboard.

Table 7| Comparison with leading T2V models on VBench. Goku achieves state-of-the-art overall performance. Detailed results across all16 evaluation dimensions are provided in Table 8 in the Appendix.

Visual comparisons highlight the difference. In the comparison below, models were asked to generate a surfer. While many models struggled with the physics of the wave or the consistency of the surfer’s body, Goku (bottom row) maintained a coherent figure and realistic water dynamics throughout the clip.

Figure 11| Qualitative comparisons of Goku-T2V with SOTA video generation methods. Key words are highlighted in RED.

Similarly, in complex underwater scenes involving drones and coral reefs, Goku demonstrates superior temporal stability—meaning the background doesn’t flicker or morph weirdly as the camera moves.

Figure 6| Qualitative comparisons with state-of-the-art (SoTA) video generation models. This figure showcases comparisons with leading models…

Ablation: Do Size and Joint Training Matter?

The researchers conducted “ablation studies” to verify their design choices.

  1. Scaling: They compared a 2B parameter model vs. an 8B model. The 8B model showed significantly better structural integrity (limbs didn’t disappear, objects stayed solid).
  2. Joint Training: They tested training on video only versus video + images. The joint training approach produced much higher photorealism, as the high-quality image data helped “teach” the video model about textures and lighting.

Figure 5|Ablation Studies of Model Scaling and Joint Training. Fig. (a) shows the comparison between Goku-T2V(2B) and Goku-T2V(8B). Fig. (b) shows the comparison between whether joint training is adopted or not.

Beyond Text-to-Video: Image-to-Video Animation

A crucial feature for creative professionals is Image-to-Video (I2V)—taking a static image and animating it based on a prompt.

Goku adapts to this by treating the input image as the first frame of the video latent sequence. Because the model was trained jointly on images and video, it naturally understands how to extend a static frame into time.

The results are impressive. Whether it is animating a splash of water or a pirate ship sailing inside a coffee cup, the model respects the initial image’s content while adding plausible motion.

Figure 12| Qualitative samples of Goku-I2V. Key words are highlighted in RED.

Conclusion

Goku represents a significant step forward in generative media. It moves beyond the trial-and-error phase of early video generation into a more rigorous, engineered approach. By combining the theoretical efficiency of Rectified Flow with the practical robustness of Joint Image-Video Training, it solves several key problems in the field:

  1. Convergence Speed: Flow matching learns faster than diffusion.
  2. Coherence: Joint training ensures high fidelity in individual frames and smooth transitions across time.
  3. Scale: The infrastructure supports training massive 8B parameter models on high-resolution data.

For students and researchers, Goku offers a blueprint for the next generation of foundation models: don’t just add more data; fix the underlying mathematical formulation (Flow vs. Diffusion) and unify the modalities (Image + Video) for a smarter, more efficient learner.