Introduction: The Challenge of Aligning Complex Data Distributions
Imagine you have two collections of images: a set of blurry photos and a set of sharp, high-resolution ones. How would you teach a model to transform any blurry photo into a realistic sharp version? Or consider translating summer landscapes into winter scenes. These are examples of a fundamental challenge in modern machine learning: finding a meaningful way to map one complex probability distribution to another.
Optimal Transport (OT) provides a rigorous mathematical framework for this task. OT seeks the most efficient mapping from one distribution to another, minimizing a given transportation cost. While powerful, standard OT produces a single, deterministic mapping. For ill-posed problems like image super-resolution, there could be many plausible high-resolution outputs for the same low-res image. We need stochastic mappings—one-to-many transformations that produce diverse but still realistic outputs.
Entropic Optimal Transport (EOT) solves this by adding randomness (entropy) to the OT problem, controlled by a regularization parameter \( \varepsilon \). Larger \( \varepsilon \) increases diversity; small \( \varepsilon \) approaches deterministic OT. The trouble is, most existing EOT algorithms become unstable or impractical when \( \varepsilon \) is small—the very regime most useful for high-quality, controlled generation.
The NeurIPS 2023 paper “Entropic Neural Optimal Transport via Diffusion Processes” introduces ENOT, a robust, end-to-end neural solver for EOT that remains stable even for small \( \varepsilon \). By reframing EOT through its connection to the Schrödinger Bridge problem from statistical physics, the authors design an elegant saddle-point optimization scheme that works at scale and achieves state-of-the-art results on both synthetic and large-scale image tasks.
Background: OT, EOT, and Schrödinger’s Bridge
Optimal Transport — Moving Probability Mass Efficiently
OT formalizes the problem of moving “stuff” (probability mass) from one distribution \( \mathbb{P}_0 \) to another \( \mathbb{P}_1 \) while minimizing a cost. The Kantorovich formulation with quadratic cost is:
Equation (1): The OT objective minimizes the average squared distance to move mass from \( x \) to \( y \), over all transport plans with the correct marginals.
Here, \( \Pi(\mathbb{P}_0, \mathbb{P}_1) \) is the set of possible transport plans \( \pi(x,y) \) with marginals \( \mathbb{P}_0 \) and \( \mathbb{P}_1 \). The optimal plan \( \pi^* \) tells us how to transfer mass from every \( x \) to every \( y \).
Entropic OT — Encouraging Stochasticity
EOT modifies OT by adding an entropy term, making the plan “spread out”:
Equations (2)–(3): Two equivalent ways to regularize OT with entropy or KL divergence, controlled by \( \varepsilon > 0 \).
Here, \( -\varepsilon H(\pi) \) encourages randomness; larger \( \varepsilon \) → more diversity in the mapping. Small \( \varepsilon \) yields maps close to deterministic OT, but is numerically challenging.
Schrödinger’s Bridge — Finding the Most Likely Path
Schrödinger’s Bridge (SB) considers the entire path from \( \mathbb{P}_0 \) at \( t=0 \) to \( \mathbb{P}_1 \) at \( t=1 \) under a stochastic process. The “prior” process is often Brownian motion (Wiener process):
Equation (4): The prior diffusion with variance \( \varepsilon \).
SB asks: among all stochastic processes matching the desired start and end distributions, which one is closest (in KL divergence) to this simple prior?
Equation (5): SB minimizes the KL divergence between candidate processes and the Wiener prior.
The Crucial Link — SB as EOT
A key insight: SB and EOT are equivalent. The KL divergence between processes can be decomposed:
Equation (6): KL splits into divergence over start/end distribution plus conditional divergence inside the path.
For the optimal SB process \( T^* \), the conditional term vanishes, leaving only the KL divergence between the joint start-end distribution and the prior’s joint distribution:
Equation (8): Optimizing over processes reduces to optimizing over start-end joint distributions—exactly the EOT problem.
Thus, solving SB yields the EOT plan \( \pi^* \). In the dynamic SB form, the optimal process is a diffusion with drift \( f(X_t, t) \), minimizing expected drift energy:
Equation (11): Energy minimization form of SB.
The ENOT Method — Saddle-Point Reformulation
Removing Hard Constraints via Lagrangian Relaxation
Directly constraining \( \pi_1^{T_f} = \mathbb{P}_1 \) is hard. ENOT introduces a Lagrangian-like functional:
Equation (12): Objective with potential \( \beta(y) \) as Lagrange multiplier for matching the final marginal.
Here:
- Drift network \( f_\theta \): defines the process to minimize \( \mathcal{L} \), reducing drift energy and generating final samples that score high under \( \beta \).
- Potential network \( \beta_\phi \): scores samples to maximize \( \mathcal{L} \), pushing generated samples closer to the target distribution.
This adversarial setup forms a saddle-point problem:
Equation (13): Maximizing over potentials, minimizing over drifts solves relaxed SB, hence EOT.
The marginal constraint emerges implicitly at equilibrium, avoiding expensive enforcement. All terms are estimable from samples, enabling stochastic gradient training.
Practical Algorithm — Entropic Neural OT
The ENOT algorithm alternates updates:
Critic update (\( \beta_\phi \)):
- Sample \( X_0 \sim \mathbb{P}_0 \), simulate process to get \( X_1 \).
- Sample \( Y \sim \mathbb{P}_1 \).
- Update \( \beta_\phi \) to maximize \( \frac{1}{|Y|}\sum\beta(Y) - \frac{1}{|X_1|}\sum\beta(X_1) \).
Generator update (\( f_\theta \)):
- Sample new \( X_0 \).
- Simulate \( X_t \) and drifts \( f_\theta(X_t, t) \).
- Minimize energy term (mean squared drift) plus adversarial term \(-\frac{1}{|X_1|}\sum\beta(X_1)\).
This repeats until convergence. The learned \( f_\theta \) defines both SB and EOT solutions.
Experiments
Toy 2D Example — Gaussian to 8 Gaussians
Figure 2: ENOT mappings for \( \varepsilon=0,0.01,0.1 \). Small \( \varepsilon \) → straight deterministic paths; larger → more stochasticity.
High-Dimensional Gaussians — Quantitative Evaluation
With Gaussian \( \mathbb{P}_0, \mathbb{P}_1 \), closed-form EOT/SB solutions exist. ENOT beats baselines in matching the target distribution and recovering the true plan:
Dim | ENOT | LSOT | SCONES | MLE-SB | DiffSB | FB-SDE-A | FB-SDE-J |
---|---|---|---|---|---|---|---|
2 | 0.01 | 1.82 | 1.74 | 0.41 | 0.70 | 0.87 | 0.03 |
16 | 0.09 | 6.42 | 1.87 | 0.50 | 1.11 | 0.94 | 0.05 |
64 | 0.23 | 32.18 | 6.27 | 1.16 | 1.98 | 1.85 | 0.19 |
128 | 0.50 | 64.32 | 6.88 | 2.13 | 2.20 | 1.95 | 0.39 |
Table 1: Error in matching marginals (lower better).
Dim | ENOT | LSOT | SCONES | MLE-SB | DiffSB | FB-SDE-A | FB-SDE-J |
---|---|---|---|---|---|---|---|
2 | 0.012 | 6.77 | 0.92 | 0.30 | 0.88 | 0.75 | 0.07 |
16 | 0.05 | 14.56 | 1.36 | 0.90 | 1.70 | 1.36 | 0.22 |
64 | 0.13 | 25.56 | 4.62 | 1.34 | 2.32 | 2.45 | 0.34 |
128 | 0.29 | 47.11 | 5.33 | 1.80 | 2.43 | 2.64 | 0.58 |
Table 2: Error in recovering true plan (lower better).
Performance gap increases with dimension.
Colored MNIST — Diversity Control
Figure 3: ENOT delivers sharp, diverse “3"s from “2"s even at small \( \varepsilon \).
ENOT achieves FID = 6.28 at \( \varepsilon = 1.0 \), outperforming SCONES (14.73) and DiffSB (93) which degrade badly at low \( \varepsilon \).
CelebA Faces — Unpaired Super-Resolution
Figure 4: ENOT preserves identity while adding realistic detail; SCONES with \( \varepsilon=100 \) loses input structure.
ENOT scores FID = 3.78 (\( \varepsilon=0 \)), 7.63 (\( \varepsilon=1 \)), vs SCONES at 14.8. Diversity increases with \( \varepsilon \), balanced by fidelity.
Figure 1 shows progressive deblurring via ENOT’s learned process:
Figure 1: Trajectories of samples for CelebA deblurring with \( \varepsilon=0,1,10 \).
Conclusion & Implications
The ENOT framework brings:
- A new viewpoint: Casting EOT as a dynamic Schrödinger Bridge problem enables efficient, sample-based saddle-point optimization.
- Stability for small \( \varepsilon \): Crucial for practical, high-fidelity mappings with controlled diversity.
- State-of-the-art results: Superior performance in high dimensions and complex image-to-image translations.
Beyond super-resolution and digit translation, ENOT’s tunable stochastic mappings could impact style transfer, domain adaptation, and any application requiring flexible yet faithful distribution alignment. By uniting physics-inspired theory with neural optimization, this work advances both the methodology and the scope of entropic optimal transport.