Stop Averaging Your Data: How Optimal Transport Revolutionizes Dataset Distillation

In the current era of deep learning, we are witnessing a voracious appetite for data. Models like CLIP or modern Large Language Models consume millions, sometimes billions, of data points. While effective, this scale creates massive bottlenecks in storage and computation. Training a model from scratch on these massive datasets is becoming a privilege reserved for those with access to supercomputing clusters.

Enter Dataset Distillation (also known as Dataset Condensation). Ideally, this process acts like a “zip file” for training data. The goal is to compress a massive dataset (like ImageNet) into a tiny synthetic set containing only a few images per class. If done correctly, a model trained on this tiny synthetic set should perform almost as well as one trained on the original millions of images.

However, existing distillation methods have a fundamental flaw: they tend to “average out” the data, losing the rich geometric structure of the original images.

In this post, we dive into a CVPR paper titled “OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation.” We will explore how the researchers identified the “Homogeneous Distance” trap and how they used Optimal Transport (OT) theory to fix it, creating a plug-and-play module that boosts performance across the board.

The Problem: The Trap of Homogeneous Distance

To understand the innovation of OPTICAL, we first need to understand how dataset distillation typically works. Most modern methods fall into the category of Subset Synthesis. Instead of selecting real images, they generate new, synthetic pixel patterns.

The optimization process usually looks like this: you have your large real dataset (\(\mathcal{T}\)) and your tiny synthetic dataset (\(\mathcal{S}\)). You define a mathematical function to measure the “distance” or difference between them, and then you update the synthetic pixels to minimize this distance.

The standard objective function often looks like this:

Equation showing the standard minimization objective.

This seems straightforward, but there is a hidden issue in how this distance (\(\mathbf{D}\)) is usually calculated. Most methods treat every real image as having an equal contribution to the synthetic images. They minimize a “homogeneous distance.”

Mathematically, the gradient (the signal telling the synthetic data how to change) usually sums up the errors uniformly:

Equation showing the gradient calculation with uniform summation.

Notice the terms \(1/|\mathcal{S}|\) and \(1/|\mathcal{T}|\). These fractions imply that every single real image contributes equally to shaping the synthetic data.

Why is Equal Contribution Bad?

Real datasets are messy. Within a single class (e.g., “Dog”), you might have Golden Retrievers, Pugs, and Huskies. Some are easy to classify; others are outliers. If you force your synthetic data to match the average of all these real images equally, you lose the specific geometric structure of the distribution.

The authors illustrate this beautifully in the figure below.

Visual comparison of distribution matching. Gray points are real data, colored points are synthetic. Top row shows original distribution, bottom shows distillation results.

In (b) and (d) above, notice how the synthetic data (blue dots) cluster tightly around the center. It doesn’t matter if the real data (gray) was a wide ring or a dense cluster; the synthetic result looks nearly identical because it just gravitated toward the average center. This is the homogeneous distance minimization trap. The synthetic data fails to capture the diversity and intra-class variance of the real world.

The Solution: OPTICAL

The researchers propose a new framework called OPTICAL (OPTImal transport for Contribution ALlocation).

The core idea is simple but powerful: Not all real images should contribute equally to every synthetic image. Instead of a uniform 1-to-1 matching, we need a dynamic system that decides which real images are most relevant to a specific synthetic image and allocates “contribution” accordingly.

They reformulate the distance minimization into a bi-level optimization problem consisting of two steps: Matching and Approximating.

Diagram of the OPTICAL pipeline showing the flow from real/synthetic data through projection, cost matrix calculation, Sinkhorn normalization, and finally the update loop.

As shown in the pipeline above, the process works as follows:

Project data into a feature space.
Match real and synthetic data using Optimal Transport to create a “Contribution Matrix.”
Approximate (update) the synthetic data using this matrix to minimize the distance.

Step 1: The Contribution Matrix

The researchers rewrite the distance equation to include a weighting matrix, \(\mathbf{P}\):

Equation showing the new distance formulation with the P matrix.

Here, \(\mathbf{P}_{ij}\) represents how much the \(i\)-th real image contributes to the \(j\)-th synthetic image. This is no longer a uniform average. If a synthetic image resembles a specific subset of real images (e.g., the “Husky” subset of dogs), \(\mathbf{P}\) will assign higher weights to those pairs and lower weights to others.

Step 2: Optimal Transport and Sinkhorn

How do we find the perfect matrix \(\mathbf{P}\)? This is where Optimal Transport (OT) comes in. OT is a mathematical framework used to determine the most efficient way to move “mass” from one distribution to another. In this context, we are calculating the cost of moving the distribution of synthetic data to cover the distribution of real data.

The optimization problem for finding \(\mathbf{P}\) looks like this:

Equation defining the optimization problem for P with entropy regularization.

The term \(\mathbf{C}\) is the Cost Matrix, measuring the difference between pairs, and \(H(\mathbf{P})\) is an entropy regularization term.

The Sinkhorn Algorithm Solving exact Optimal Transport is computationally expensive (\(O(N^3)\)), which would make training impossibly slow. To solve this, the authors use the Sinkhorn algorithm, which provides a fast, iterative approximation.

They initialize a kernel matrix \(\mathbf{K}\) and then iteratively normalize the rows and columns to ensure they sum up to the correct marginals (the total number of real and synthetic images).

Equation showing the Sinkhorn iterative updates.

This iterative process is highly efficient on GPUs. After a fixed number of iterations (\(T\)), the resulting matrix \(\mathbf{K}^T\) becomes our contribution matrix \(\mathbf{P}^\lambda\).

Step 3: The Hilbert Space

One final technical innovation in OPTICAL is the “ruler” used to measure distance. Simple Euclidean distance or Cosine similarity often misses high-order structural nuances in complex data.

To fix this, the authors project the data representations into a Reproducing Kernel Hilbert Space (RKHS). They use Gaussian kernels to measure similarity:

Equation showing the Gaussian kernel function.

By combining multiple kernels with different scales (\(\sigma_k\)), they compute a Relevance Matrix (\(\mathbf{R}\)):

Equation showing the calculation of the relevance matrix R.

The Cost Matrix (\(\mathbf{C}\)) used in the Optimal Transport step is then derived simply as \(\mathbf{J} - \mathbf{R}\) (where \(\mathbf{J}\) is a matrix of ones). This means that pairs with high relevance (similarity) have a low transport cost.

Experimental Results

The OPTICAL framework is designed to be plug-and-play. It can be added to almost any existing dataset distillation method, whether it’s optimization-oriented (like DC or DREAM) or distribution-matching-based (like DM or M3D).

1. Performance Boost Across Datasets

The authors tested OPTICAL on datasets ranging from MNIST to ImageNet. The results were consistent: adding OPTICAL improved test accuracy, often significantly.

Table showing performance comparison on low-resolution datasets like CIFAR-10. Red text indicates significant gains.

In Table 2, look at the rows for DM (Distribution Matching). On CIFAR-100 (IPC=50), the accuracy jumps from 43.6% to 44.5%. While that might seem small, in the world of dataset distillation, strictly consistently beating the baseline across all settings is a strong signal of robustness. The gains are even more pronounced in the IDM method, seeing a 4.3% jump on CIFAR-10 (IPC=1).

2. Faster and Better Convergence

Does the dynamic contribution allocation actually help the training process?

Line charts comparing test accuracy over training steps for CIFAR-10 and CIFAR-100.

Figure 3 shows the test accuracy over training steps. The red line (M3D + OPTICAL) consistently sits above the green line (Baseline M3D).

Faster Start: OPTICAL achieves usable accuracy levels much earlier in the training.
Higher Ceiling: It avoids the “plateau” that the baseline hits, proving that the synthetic data is capturing more useful information and not just getting stuck at a mean.

3. Cross-Architecture Generalization

A major critique of dataset distillation is that synthetic data often overfits the specific neural network architecture used to generate it. If you generate data using a ConvNet, it might perform poorly when training a ResNet.

OPTICAL helps alleviate this issue significantly. By capturing the underlying geometric structure of the data rather than just matching gradients or mean statistics, the synthetic data becomes more “universal.”

Table showing cross-architecture performance. Data distilled on ConvNet-3 is evaluated on ResNet and DenseNet.

Table 6 illustrates this cross-architecture robustness. When synthetic data is generated using a simple ConvNet-3 but evaluated on a DenseNet-121:

The baseline DM method achieves 39.0% accuracy.
DM + OPTICAL achieves 41.9% accuracy.
The DANCE method sees a massive jump from 64.5% to 66.8% with OPTICAL.

This suggests that OPTICAL produces synthetic data that contains genuine, transferable features rather than architecture-specific artifacts.

Why This Matters

The “Homogeneous Distance Minimization” problem is a subtle but pervasive issue in data synthesis. By treating every data point as an equal contributor, we inadvertently smooth out the very textures and irregularities that make deep learning models robust.

OPTICAL demonstrates that we don’t need to invent entirely new distillation paradigms to fix this. By borrowing Optimal Transport theory—specifically the concept of calculating the most efficient way to map one distribution to another—we can make existing methods smarter.

Key Takeaways:

No More Averaging: Synthetic data should not just be the average of real data. It needs to reflect the geometric structure.
Dynamic Allocation: Using Optimal Transport allows the system to dynamically decide which real images should influence which synthetic images.
Efficiency: Thanks to the Sinkhorn algorithm, this complex math can be integrated into training loops with minimal computational overhead (adding only milliseconds per iteration).
Versatility: It works on top of existing methods, improving results for low-resolution and high-resolution datasets alike.

As datasets continue to grow, techniques like OPTICAL will be essential for making AI more sustainable, accessible, and efficient.

Stop Averaging Your Data: How Optimal Transport Revolutionizes Dataset Distillation#

The Problem: The Trap of Homogeneous Distance#

Why is Equal Contribution Bad?#

The Solution: OPTICAL#

Step 1: The Contribution Matrix#

Step 2: Optimal Transport and Sinkhorn#

Step 3: The Hilbert Space#

Experimental Results#

1. Performance Boost Across Datasets#

2. Faster and Better Convergence#

3. Cross-Architecture Generalization#

Why This Matters#