TinyFusion: How to Shrink Diffusion Transformers Without Losing Their Magic

If you have been following the generative AI space recently, you know that Diffusion Transformers (DiTs) are the current heavyweights. From OpenAI’s Sora to Stable Diffusion 3, replacing the traditional U-Net backbone with a Transformer architecture has unlocked incredible capabilities in image and video generation.

But there is a catch: these models are massive. They come with excessive parameter counts that make them slow and expensive to run in real-world applications. If you want to deploy a high-quality image generator on a mobile device or a standard consumer GPU, you are often out of luck.

The standard solution is pruning—removing parts of the neural network to make it smaller. But how do you decide what to cut? The conventional wisdom suggests removing the “least important” layers based on error metrics. However, a new paper titled “TinyFusion: Diffusion Transformers Learned Shallow” argues that this conventional wisdom is wrong for diffusion models.

In this post, we will dive deep into TinyFusion. We will explore why traditional pruning fails for DiTs, how the authors introduced a “learnable” pruning method that predicts future performance, and how they achieved a 2x speedup with barely any loss in image quality.

The Problem: Why “Smart” Pruning Often Fails

To make a model faster, we generally have two options:

Width Pruning: Making the layers narrower (fewer channels/neurons).
Depth Pruning: Removing entire layers from the network.

For Transformers, depth pruning is usually the better choice for speed. Because GPUs are massive parallel processors, they handle wide layers easily. However, layers in a Transformer must be processed sequentially. If you have 28 layers, the GPU has to finish Layer 1 before starting Layer 2. Therefore, cutting the number of layers in half theoretically doubles your inference speed.

The authors demonstrate this advantage clearly in the graph below. While width pruning (the blue line) struggles to gain speed, depth pruning (the red dashed line) offers a nearly linear speedup.

Comparison of speedup between depth and width pruning.

The Calibration Loss Paradox

So, depth pruning is the way to go. The challenge is deciding which layers to delete.

Most existing methods use a metric called Calibration Loss. The logic is simple: remove a layer, check how much the error (loss) increases, and keep the layers that, if removed, would cause the biggest error. You want a pruned model that looks as close as possible to the original model right now.

The researchers behind TinyFusion discovered a paradox: A pruned model with low initial error does not necessarily learn well during fine-tuning.

They ran an experiment where they randomly pruned a DiT model 100,000 different ways and then fine-tuned them. They found that models with the lowest starting error (Min. Loss) actually performed worse after fine-tuning than models that started with higher error.

Histogram showing calibration loss distribution.

As shown above, the models found by standard sensitivity analysis (which minimizes loss) result in mediocre final performance. The “Learnable” area—where TinyFusion operates—starts with higher loss but recovers much better.

The takeaway: We shouldn’t be looking for the model that knows the most after pruning. We should be looking for the model that learns the fastest during fine-tuning. This property is called Recoverability.

The Solution: TinyFusion

TinyFusion is a new framework that treats pruning not as a one-time calculation, but as a learnable process. Instead of using heuristic scores, the method trains the pruning selection itself.

The core idea is to simultaneously optimize two things:

The Mask: Which layers to keep and which to drop.
The Weights: A simulation of how the model would adapt if those layers were dropped.

1. Differentiable Sampling

The researchers view layer selection as a probability distribution. For every block of layers, there is a probability assigned to different pruning configurations (masks).

The problem is that “picking a layer” is a discrete decision (you either keep it or you don’t), which breaks the gradient flow needed for backpropagation. To solve this, they use Gumbel-Softmax sampling. This allows the network to “softly” sample masks during training, making the selection process differentiable.

\[ y = { \mathrm { o n e - h o t } } \left( { \frac { \exp ( ( g _ { i } + \log p _ { i } ) / \tau ) } { \sum _ { j } \exp ( ( g _ { j } + \log p _ { j } ) / \tau ) } } \right) . \]

This equation essentially adds noise (\(g\)) to the probabilities (\(p\)) and uses a temperature parameter (\(\tau\)) to gradually transition from soft probabilities to hard decisions (0 or 1).

2. Recoverability Estimation with LoRA

Here is the genius part of TinyFusion. To know if a pruned model is “recoverable,” you normally have to fine-tune it for hours. You can’t do that for every training step of the pruning algorithm.

To get around this, the authors introduce a lightweight, co-optimized weight update. Instead of updating the full massive model to test recoverability, they use LoRA (Low-Rank Adaptation).

Forward propagation with differentiable pruning mask and LoRA.

As shown in Figure 3 above, the system applies the mask (\(m_i\)) to the layer. Simultaneously, it applies a LoRA update (the orange blocks \(B\) and \(A\)) to the weights.

The optimization objective becomes:

\[ \underset { \{ p ( \mathfrak { m } _ { k } ) \} } { \operatorname* { m i n } } \underset { \Delta \Phi } { \underbrace { \operatorname* { m i n } } } \mathbb { E } _ { \boldsymbol { x } , \{ \mathfrak { m } _ { k } \sim p ( \mathfrak { m } _ { k } ) \} } \big [ \mathcal { L } ( \boldsymbol { x } , \Phi + \Delta \Phi , \{ \mathfrak { m } _ { k } \} \big ] , \]

In plain English: “Find the probability distribution of masks (\(p\)) such that, if we update the model slightly (\(\Delta \Phi\)), the loss is minimized.”

This essentially simulates “future fine-tuning” inside the pruning loop. The model learns to favor masks that react well to weight updates.

3. The Workflow

The entire process works in two phases:

Search (Training): The model runs with the learnable masks and LoRA adapters. Over time, the probability distribution shifts. Good configurations get higher probabilities; bad ones get dropped.
Fine-tuning: Once the best layers are identified, the mask is fixed, the LoRA is discarded, and the resulting smaller model is fine-tuned properly.

Diagram of the learnable pruning workflow.

In the visualization above (Figure 2), you can see the transition from “Mixed Sampling” (exploring different layer combinations) to “Confident Sampling” (settling on the optimal shallow architecture).

Turbocharging Recovery: Masked Knowledge Distillation

Once TinyFusion identifies the best layers to keep, the model is pruned. Now, it needs to be retrained to regain its original quality. This is standard procedure, usually done via Knowledge Distillation (KD)—where the small “student” model tries to mimic the large “teacher” model.

However, the authors ran into a problem specific to Diffusion Transformers: Massive Activations.

In large transformers, certain neurons sometimes fire with massive values (outliers). While the teacher model handles these fine, forcing a smaller, pruned student to mimic these exact massive values can destabilize training and cause the loss to explode.

Visualization of massive activations in DiTs.

In Figure 8 above, you can see the activation spikes. To fix this, the authors proposed Masked Representation KD.

They simply apply a threshold. If an activation value in the teacher or student is too large (an outlier), it is masked out (ignored) during the loss calculation.

Masked knowledge distillation diagram.

This ensures the student focuses on learning the core structure of the data rather than chasing numerical anomalies, leading to significantly faster and more stable convergence.

Experiments and Results

Does it work? The results are impressive.

The researchers tested TinyFusion on DiT-XL/2 (a standard Diffusion Transformer trained on ImageNet). They aimed to compress the 28-layer model down to 14 layers (50% pruning).

Quantitative Results

Table showing performance comparison of TinyFusion against baselines.

Looking at Table 1:

Original DiT-XL/2: FID of 2.27 (lower is better).
Existing Methods (ShortGPT, Flux-Lite): When pruned to 14 layers, their FID scores skyrocketed to over 20. They essentially broke the model.
TinyFusion (TinyDiT-D14): Achieved an FID of 2.86.

This is a massive improvement. The TinyDiT model runs at 13.54 iterations per second (nearly 2x the speed of the original 6.91 it/s) while maintaining image quality that is visually comparable to the original.

Furthermore, TinyFusion achieved this with only 7% of the original pre-training cost.

Visualizing the Learning Process

It is fascinating to watch how the model “decides” which layers to keep. The graph below tracks the pruning decisions over training iterations.

Visualization of pruning decisions over time.

Bottom layers (Indices 0-3): The model quickly decides these are essential (solid lines).
Middle layers: There is a period of exploration (fuzziness) where the model is unsure.
Convergence: By step 10,000, the model makes hard decisions, effectively locking in the final architecture.

Qualitative Results

Numbers are great, but for image generation, we need to see the pictures. Here are samples generated by the pruned TinyDiT-D14.

Images generated by TinyDiT-D14.

The images are sharp, coherent, and indistinguishable from those generated by much larger models.

The method also generalizes well. The authors applied it to other architectures like SiT (Scalable Interpolant Transformers) and MAR (Masked Autoregressive models), achieving similar success.

Generated images from TinySiT and TinyMAR.

Conclusion and Key Takeaways

TinyFusion represents a significant shift in how we think about compressing generative models.

Don’t trust immediate loss: For diffusion models, a pruned model that looks “broken” initially might actually be the best candidate for fine-tuning.
Make pruning learnable: By treating layer selection as a differentiable sampling problem, we can use gradient descent to find the optimal architecture.
Simulate the future: Using LoRA during the search phase allows the pruning algorithm to “peek” into the future and see how well a configuration will recover.

As we move toward running complex AI models on laptops and phones, techniques like TinyFusion will be essential. They allow us to strip away the fat of these massive neural networks, leaving only the muscle required to create the magic.

TinyFusion: How to Shrink Diffusion Transformers Without Losing Their Magic#

The Problem: Why “Smart” Pruning Often Fails#

The Calibration Loss Paradox#

The Solution: TinyFusion#

1. Differentiable Sampling#

2. Recoverability Estimation with LoRA#

3. The Workflow#

Turbocharging Recovery: Masked Knowledge Distillation#

Experiments and Results#

Quantitative Results#

Visualizing the Learning Process#

Qualitative Results#

Conclusion and Key Takeaways#