The landscape of generative AI has shifted dramatically with the adoption of Diffusion Transformers (DiTs). Models like Stable Diffusion 3 and Sora have demonstrated that replacing the traditional U-Net backbone with a Transformer architecture leads to scalable, high-fidelity results. However, this performance comes at a steep computational cost.

Current diffusion models operate on a static paradigm: they allocate a fixed, heavy amount of compute to every single step of the denoising process. Whether the model is resolving the vague outline of a composition or refining the texture of a cat’s fur, it burns the same number of FLOPs (Floating Point Operations).

This blog post explores FlexiDiT, a novel framework that challenges this inefficiency. By “flexifying” DiT models, researchers have found a way to dynamically adjust the compute budget per step, achieving massive efficiency gains—over 40% for images and up to 75% for videos—without sacrificing quality.

Visual comparison of generated images showing maintained quality across varying FLOP percentages. The top row shows frogs with crowns, the bottom shows waterfalls. As FLOPs decrease from 100% to 37.2%, quality remains high until the very lowest settings.

The Intuition: Not All Steps Are Created Equal

To understand why FlexiDiT works, we first need to look at how Diffusion Transformers process images.

In a standard Vision Transformer (ViT) or DiT, an image isn’t processed as a grid of pixels. Instead, it is chopped up into patches (e.g., \(2 \times 2\) or \(4 \times 4\) pixels). These patches are flattened and projected into embeddings, creating a sequence of tokens.

Diagram showing how an image is tokenized into patches. On the left, small patches result in many tokens. On the right, large patches result in fewer tokens.

As shown above, the patch size (\(p\)) dictates the sequence length (\(N\)). If you double the patch size, you reduce the number of tokens by a factor of four. Since the computational complexity of the self-attention mechanism in Transformers scales quadratically with the number of tokens (\(O(N^2)\)), reducing the token count yields massive speedups.

The Spectral Reality of Diffusion

Diffusion models generate images by iteratively removing noise. However, the nature of the “signal” being recovered changes throughout this process.

Early Steps (High Noise): The model focuses on low-frequency details—global structure, layout, and large shapes.
Late Steps (Low Noise): The model focuses on high-frequency details—textures, sharp edges, and fine noise.

The researchers behind FlexiDiT empirically demonstrated this by applying low-pass and high-pass filters to single diffusion update steps.

Graphs and images illustrating the spectral autoregression of diffusion. The left panel shows images becoming noisier (right to left). The right panel plots error metrics (LPIPS, L2, SSIM) showing how low and high pass filters affect generation differently at different stages.

The key insight here is that early denoising steps focus on low-frequency information. You do not need a fine-grained grid of tokens to decide where the sky goes versus the ground. You can achieve the same result with coarser tokens (larger patches). Conversely, later steps require high-resolution tokens to generate crisp details.

Using a “powerful” model (small patches, high compute) for the early, coarse steps is a waste of resources. FlexiDiT solves this by allowing the model to switch between “weak” (large patch) and “powerful” (small patch) modes on the fly.

The FlexiDiT Framework

The goal of FlexiDiT is to take a pre-trained DiT and make it “flexible,” meaning it can process inputs at varying patch sizes. This allows us to use a Dynamic Inference Scheduler: using large patches for early steps and small patches for late steps.

The researchers propose two primary methods to achieve this, depending on whether you want to fine-tune the whole model or keep the original weights frozen.

Method 1: Shared Parameters (Full Fine-Tuning)

If you have access to the original training data and resources, you can fine-tune the entire model. The core Transformer blocks remain the same—a Transformer can technically process a sequence of any length. The challenge lies in the input and output layers, which are tied to specific spatial dimensions.

To fix this, the authors adjust the Tokenization (Embedding) and De-tokenization layers.

Left: Diagram showing the FlexiDiT architecture where embedding layers are adjusted to handle varying patch sizes. Right: A plot showing the difference in predictions between weak and powerful models, which is low in early steps.

As illustrated in the figure above (Left), the model introduces new embedding weights for the new patch sizes.

Initialization: The new weights are initialized from the pre-trained weights using a pseudo-inverse projection (bi-linear interpolation). This ensures the model starts with a good understanding of the image structure.
Positional Encodings: These are interpolated to match the new grid size.
Training: The model is fine-tuned to denoise images using randomly selected patch sizes.

This allows a single set of Transformer weights to process sequences of different lengths.

Method 2: LoRA (Parameter Efficient Fine-Tuning)

In many cases, full fine-tuning is too expensive or the original dataset is unavailable. For this, the authors utilize Low-Rank Adaptation (LoRA).

Instead of retraining the massive DiT block, they freeze the original weights. They then add small, trainable LoRA matrices specifically for the new “weak” patch sizes.

Diagram of the LoRA implementation. The original model weights are frozen (blue). New trainable LoRA layers (yellow) are added for the weak patch embedders and de-embedders.

During inference with a “weak” step (large patch), the LoRA adapters are active. During a “powerful” step (original patch size), the adapters are deactivated, ensuring the model retains its original, high-quality behavior exactly.

Knowledge Distillation

To make the training efficient, the authors use the original “powerful” model as a teacher. They train the “weak” model to mimic the predictions of the powerful model. This is defined by minimizing the distance between the noise prediction of the powerful model (\(\epsilon_{\theta}(\dots; p_{\text{powerful}})\)) and the weak model (\(\epsilon_{\theta}(\dots; p_{\text{weak}})\)):

\[ \mathbb { E } _ { t , \mathbf { x } _ { t } } \Vert \epsilon _ { \theta } ( \mathbf { x } _ { t - 1 } \vert \mathbf { x } _ { t } ; \boldsymbol { p } _ { \mathrm { p o w e r f u l } } ) - \epsilon _ { \theta } ( \mathbf { x } _ { t - 1 } \vert \mathbf { x } _ { t } ; \boldsymbol { p } _ { \mathrm { w e a k } } ) \Vert _ { 2 } . \]

This ensures that the weak model learns to approximate the powerful model’s output as closely as possible, rather than learning from scratch.

Inference: The Dynamic Scheduler

Once the model is trained to handle multiple patch sizes, the inference strategy is straightforward.

We define a schedule where the first \(T_{\text{weak}}\) steps use the large patch size (low compute), and the remaining steps use the original small patch size (high compute).

Weak Model: Patch size \(4 \times 4\). Sequence length is \(16 \times\) smaller than pixels. Fast.
Powerful Model: Patch size \(2 \times 2\). Sequence length is \(4 \times\) smaller than pixels. Slow.

By simply adjusting \(T_{\text{weak}}\), we can slide the dial between maximum quality and maximum speed.

Charts showing performance metrics. Left: FID improves as you use more powerful steps, but saving 40% FLOPs has almost no penalty. Middle: Optimal CFG scales vary by compute. Right: FlexiDiT benefits are orthogonal to reducing total steps.

The graph on the left (Figure 6) is the “money shot.” It shows the FID (Fréchet Inception Distance, where lower is better) relative to the percentage of FLOPs used.

At 100% FLOPs (baseline), the FID is ~2.25.
At ~60% FLOPs, the FID is still ~2.25.

This implies that roughly 40% of the computation in standard diffusion models is redundant and can be offloaded to a weaker, faster model without any perceptual loss in quality.

Experiments and Results

The authors tested FlexiDiT across class-conditioned generation (ImageNet), text-to-image (T2I), and video generation.

Text-to-Image Performance

For text-to-image models (like Emu or PIXART), the results held strong. The authors plotted FID against CLIP scores (which measure how well the image matches the text prompt).

Three charts comparing FID vs CLIP scores. The leftmost chart shows that FlexiDiT (colored lines) can match the baseline (red line) quality with significantly less compute.

The results show that FlexiDiT models (the various colored points) lie along the same Pareto frontier as the fully compute-intensive baseline. In plain English: you can achieve the same alignment and image quality as the heavy model, but you get there much faster.

Video Generation: The Big Wins

Video generation is where the quadratic complexity of Transformers really hurts. Videos add a temporal dimension, causing the number of tokens to explode.

FlexiDiT applies the same logic here: use larger patches across space and time for the early steps.

Left: Video frames generated with 25% compute. Right: Graph showing VBench scores remain stable even when FLOPs are reduced to nearly 25%.

For video, the savings are massive. The model maintains its VBench quality score even when using only ~25% of the baseline compute. This is a potential game-changer for the economics of AI video generation.

It’s Not Just Theoretical FLOPs

A common criticism in efficiency research is that “theoretical FLOPs” don’t always translate to real-world latency (speed) due to memory bandwidth bottlenecks or hardware inefficiency.

The authors profiled FlexiDiT on an NVIDIA H100 GPU to verify the gains.

Graph plotting Performance (TFLOP/s) vs Operational Intensity. It shows that the weak models (high patch size) are still compute-bound and utilize the GPU efficiently.

The graph above shows that FlexiDiT (specifically the weak model configurations) maintains high GPU utilization. Because self-attention creates a bottleneck for long sequences (small patches), shifting to larger patches actually utilizes the hardware more effectively, meaning the latency reductions are real and proportional to the FLOPs reduction.

Conclusion

FlexiDiT revisits a fundamental assumption in diffusion models: that every step requires the same amount of “brain power.” By aligning the computational budget with the spectral nature of the denoising process—coarse compute for coarse features, fine compute for fine features—the authors unlocked significant efficiency gains.

Key Takeaways:

Dynamic Patching: Changing patch sizes during inference allows for variable compute.
No Quality Loss: You can cut compute by ~40% for images and ~75% for videos with virtually no visual degradation.
Easy Integration: The LoRA-based method allows this to be applied to existing pre-trained models with minimal training data (~5000 images) and compute (<5% of original training cost).

As models continue to scale, techniques like FlexiDiT that optimize the inference phase will be critical in making high-fidelity generative AI accessible and affordable.

The Intuition: Not All Steps Are Created Equal#

The Spectral Reality of Diffusion#

The FlexiDiT Framework#

Method 1: Shared Parameters (Full Fine-Tuning)#

Method 2: LoRA (Parameter Efficient Fine-Tuning)#

Knowledge Distillation#

Inference: The Dynamic Scheduler#

Experiments and Results#

Text-to-Image Performance#

Video Generation: The Big Wins#

It’s Not Just Theoretical FLOPs#

Conclusion#