The landscape of generative AI has shifted dramatically with the adoption of Diffusion Transformers (DiTs). Models like Stable Diffusion 3 and Sora have demonstrated that replacing the traditional U-Net backbone with a Transformer architecture leads to scalable, high-fidelity results. However, this performance comes at a steep computational cost.
Current diffusion models operate on a static paradigm: they allocate a fixed, heavy amount of compute to every single step of the denoising process. Whether the model is resolving the vague outline of a composition or refining the texture of a cat’s fur, it burns the same number of FLOPs (Floating Point Operations).
This blog post explores FlexiDiT, a novel framework that challenges this inefficiency. By “flexifying” DiT models, researchers have found a way to dynamically adjust the compute budget per step, achieving massive efficiency gains—over 40% for images and up to 75% for videos—without sacrificing quality.

The Intuition: Not All Steps Are Created Equal
To understand why FlexiDiT works, we first need to look at how Diffusion Transformers process images.
In a standard Vision Transformer (ViT) or DiT, an image isn’t processed as a grid of pixels. Instead, it is chopped up into patches (e.g., \(2 \times 2\) or \(4 \times 4\) pixels). These patches are flattened and projected into embeddings, creating a sequence of tokens.

As shown above, the patch size (\(p\)) dictates the sequence length (\(N\)). If you double the patch size, you reduce the number of tokens by a factor of four. Since the computational complexity of the self-attention mechanism in Transformers scales quadratically with the number of tokens (\(O(N^2)\)), reducing the token count yields massive speedups.
The Spectral Reality of Diffusion
Diffusion models generate images by iteratively removing noise. However, the nature of the “signal” being recovered changes throughout this process.
- Early Steps (High Noise): The model focuses on low-frequency details—global structure, layout, and large shapes.
- Late Steps (Low Noise): The model focuses on high-frequency details—textures, sharp edges, and fine noise.
The researchers behind FlexiDiT empirically demonstrated this by applying low-pass and high-pass filters to single diffusion update steps.

The key insight here is that early denoising steps focus on low-frequency information. You do not need a fine-grained grid of tokens to decide where the sky goes versus the ground. You can achieve the same result with coarser tokens (larger patches). Conversely, later steps require high-resolution tokens to generate crisp details.
Using a “powerful” model (small patches, high compute) for the early, coarse steps is a waste of resources. FlexiDiT solves this by allowing the model to switch between “weak” (large patch) and “powerful” (small patch) modes on the fly.
The FlexiDiT Framework
The goal of FlexiDiT is to take a pre-trained DiT and make it “flexible,” meaning it can process inputs at varying patch sizes. This allows us to use a Dynamic Inference Scheduler: using large patches for early steps and small patches for late steps.
The researchers propose two primary methods to achieve this, depending on whether you want to fine-tune the whole model or keep the original weights frozen.
Method 1: Shared Parameters (Full Fine-Tuning)
If you have access to the original training data and resources, you can fine-tune the entire model. The core Transformer blocks remain the same—a Transformer can technically process a sequence of any length. The challenge lies in the input and output layers, which are tied to specific spatial dimensions.
To fix this, the authors adjust the Tokenization (Embedding) and De-tokenization layers.

As illustrated in the figure above (Left), the model introduces new embedding weights for the new patch sizes.
- Initialization: The new weights are initialized from the pre-trained weights using a pseudo-inverse projection (bi-linear interpolation). This ensures the model starts with a good understanding of the image structure.
- Positional Encodings: These are interpolated to match the new grid size.
- Training: The model is fine-tuned to denoise images using randomly selected patch sizes.
This allows a single set of Transformer weights to process sequences of different lengths.
Method 2: LoRA (Parameter Efficient Fine-Tuning)
In many cases, full fine-tuning is too expensive or the original dataset is unavailable. For this, the authors utilize Low-Rank Adaptation (LoRA).
Instead of retraining the massive DiT block, they freeze the original weights. They then add small, trainable LoRA matrices specifically for the new “weak” patch sizes.

During inference with a “weak” step (large patch), the LoRA adapters are active. During a “powerful” step (original patch size), the adapters are deactivated, ensuring the model retains its original, high-quality behavior exactly.
Knowledge Distillation
To make the training efficient, the authors use the original “powerful” model as a teacher. They train the “weak” model to mimic the predictions of the powerful model. This is defined by minimizing the distance between the noise prediction of the powerful model (\(\epsilon_{\theta}(\dots; p_{\text{powerful}})\)) and the weak model (\(\epsilon_{\theta}(\dots; p_{\text{weak}})\)):
\[ \mathbb { E } _ { t , \mathbf { x } _ { t } } \Vert \epsilon _ { \theta } ( \mathbf { x } _ { t - 1 } \vert \mathbf { x } _ { t } ; \boldsymbol { p } _ { \mathrm { p o w e r f u l } } ) - \epsilon _ { \theta } ( \mathbf { x } _ { t - 1 } \vert \mathbf { x } _ { t } ; \boldsymbol { p } _ { \mathrm { w e a k } } ) \Vert _ { 2 } . \]This ensures that the weak model learns to approximate the powerful model’s output as closely as possible, rather than learning from scratch.
Inference: The Dynamic Scheduler
Once the model is trained to handle multiple patch sizes, the inference strategy is straightforward.
We define a schedule where the first \(T_{\text{weak}}\) steps use the large patch size (low compute), and the remaining steps use the original small patch size (high compute).
- Weak Model: Patch size \(4 \times 4\). Sequence length is \(16 \times\) smaller than pixels. Fast.
- Powerful Model: Patch size \(2 \times 2\). Sequence length is \(4 \times\) smaller than pixels. Slow.
By simply adjusting \(T_{\text{weak}}\), we can slide the dial between maximum quality and maximum speed.

The graph on the left (Figure 6) is the “money shot.” It shows the FID (Fréchet Inception Distance, where lower is better) relative to the percentage of FLOPs used.
- At 100% FLOPs (baseline), the FID is ~2.25.
- At ~60% FLOPs, the FID is still ~2.25.
This implies that roughly 40% of the computation in standard diffusion models is redundant and can be offloaded to a weaker, faster model without any perceptual loss in quality.
Experiments and Results
The authors tested FlexiDiT across class-conditioned generation (ImageNet), text-to-image (T2I), and video generation.
Text-to-Image Performance
For text-to-image models (like Emu or PIXART), the results held strong. The authors plotted FID against CLIP scores (which measure how well the image matches the text prompt).

The results show that FlexiDiT models (the various colored points) lie along the same Pareto frontier as the fully compute-intensive baseline. In plain English: you can achieve the same alignment and image quality as the heavy model, but you get there much faster.
Video Generation: The Big Wins
Video generation is where the quadratic complexity of Transformers really hurts. Videos add a temporal dimension, causing the number of tokens to explode.
FlexiDiT applies the same logic here: use larger patches across space and time for the early steps.

For video, the savings are massive. The model maintains its VBench quality score even when using only ~25% of the baseline compute. This is a potential game-changer for the economics of AI video generation.
It’s Not Just Theoretical FLOPs
A common criticism in efficiency research is that “theoretical FLOPs” don’t always translate to real-world latency (speed) due to memory bandwidth bottlenecks or hardware inefficiency.
The authors profiled FlexiDiT on an NVIDIA H100 GPU to verify the gains.

The graph above shows that FlexiDiT (specifically the weak model configurations) maintains high GPU utilization. Because self-attention creates a bottleneck for long sequences (small patches), shifting to larger patches actually utilizes the hardware more effectively, meaning the latency reductions are real and proportional to the FLOPs reduction.
Conclusion
FlexiDiT revisits a fundamental assumption in diffusion models: that every step requires the same amount of “brain power.” By aligning the computational budget with the spectral nature of the denoising process—coarse compute for coarse features, fine compute for fine features—the authors unlocked significant efficiency gains.
Key Takeaways:
- Dynamic Patching: Changing patch sizes during inference allows for variable compute.
- No Quality Loss: You can cut compute by ~40% for images and ~75% for videos with virtually no visual degradation.
- Easy Integration: The LoRA-based method allows this to be applied to existing pre-trained models with minimal training data (~5000 images) and compute (<5% of original training cost).
As models continue to scale, techniques like FlexiDiT that optimize the inference phase will be critical in making high-fidelity generative AI accessible and affordable.
](https://deep-paper.org/en/paper/2502.20126/images/cover.png)