From Schrödinger's Bridge to Neural Nets: A New End-to-End Solver for Entropic Optimal Transport

Introduction: The Challenge of Aligning Complex Data Distributions

Imagine you have two collections of images: a set of blurry photos and a set of sharp, high-resolution ones. How would you teach a model to transform any blurry photo into a realistic sharp version? Or consider translating summer landscapes into winter scenes. These are examples of a fundamental challenge in modern machine learning: finding a meaningful way to map one complex probability distribution to another.

Optimal Transport (OT) provides a rigorous mathematical framework for this task. OT seeks the most efficient mapping from one distribution to another, minimizing a given transportation cost. While powerful, standard OT produces a single, deterministic mapping. For ill-posed problems like image super-resolution, there could be many plausible high-resolution outputs for the same low-res image. We need stochastic mappings—one-to-many transformations that produce diverse but still realistic outputs.

Entropic Optimal Transport (EOT) solves this by adding randomness (entropy) to the OT problem, controlled by a regularization parameter \( \varepsilon \). Larger \( \varepsilon \) increases diversity; small \( \varepsilon \) approaches deterministic OT. The trouble is, most existing EOT algorithms become unstable or impractical when \( \varepsilon \) is small—the very regime most useful for high-quality, controlled generation.

The NeurIPS 2023 paper “Entropic Neural Optimal Transport via Diffusion Processes” introduces ENOT, a robust, end-to-end neural solver for EOT that remains stable even for small \( \varepsilon \). By reframing EOT through its connection to the Schrödinger Bridge problem from statistical physics, the authors design an elegant saddle-point optimization scheme that works at scale and achieves state-of-the-art results on both synthetic and large-scale image tasks.

Background: OT, EOT, and Schrödinger’s Bridge

Optimal Transport — Moving Probability Mass Efficiently

OT formalizes the problem of moving “stuff” (probability mass) from one distribution \( \mathbb{P}_0 \) to another \( \mathbb{P}_1 \) while minimizing a cost. The Kantorovich formulation with quadratic cost is:

The Kantorovich formulation of Optimal Transport with a quadratic cost.

Equation (1): The OT objective minimizes the average squared distance to move mass from \( x \) to \( y \), over all transport plans with the correct marginals.

Here, \( \Pi(\mathbb{P}_0, \mathbb{P}_1) \) is the set of possible transport plans \( \pi(x,y) \) with marginals \( \mathbb{P}_0 \) and \( \mathbb{P}_1 \). The optimal plan \( \pi^* \) tells us how to transfer mass from every \( x \) to every \( y \).

Entropic OT — Encouraging Stochasticity

EOT modifies OT by adding an entropy term, making the plan “spread out”:

The two common formulations for Entropic Optimal Transport.

Equations (2)–(3): Two equivalent ways to regularize OT with entropy or KL divergence, controlled by \( \varepsilon > 0 \).

Here, \( -\varepsilon H(\pi) \) encourages randomness; larger \( \varepsilon \) → more diversity in the mapping. Small \( \varepsilon \) yields maps close to deterministic OT, but is numerically challenging.

Schrödinger’s Bridge — Finding the Most Likely Path

Schrödinger’s Bridge (SB) considers the entire path from \( \mathbb{P}_0 \) at \( t=0 \) to \( \mathbb{P}_1 \) at \( t=1 \) under a stochastic process. The “prior” process is often Brownian motion (Wiener process):

The Stochastic Differential Equation for a Wiener process with variance ε.

Equation (4): The prior diffusion with variance \( \varepsilon \).

SB asks: among all stochastic processes matching the desired start and end distributions, which one is closest (in KL divergence) to this simple prior?

The Schrödinger Bridge problem, minimizing the KL divergence to a Wiener process.

Equation (5): SB minimizes the KL divergence between candidate processes and the Wiener prior.

The Crucial Link — SB as EOT

A key insight: SB and EOT are equivalent. The KL divergence between processes can be decomposed:

The decomposition of KL divergence between two stochastic processes.

Equation (6): KL splits into divergence over start/end distribution plus conditional divergence inside the path.

For the optimal SB process \( T^* \), the conditional term vanishes, leaving only the KL divergence between the joint start-end distribution and the prior’s joint distribution:

The equivalence between minimizing KL over processes and minimizing KL over their start-end joint distributions.

Equation (8): Optimizing over processes reduces to optimizing over start-end joint distributions—exactly the EOT problem.

Thus, solving SB yields the EOT plan \( \pi^* \). In the dynamic SB form, the optimal process is a diffusion with drift \( f(X_t, t) \), minimizing expected drift energy:

The Dynamic Schrödinger Bridge problem, minimizing the expected energy of the drift function.

Equation (11): Energy minimization form of SB.

The ENOT Method — Saddle-Point Reformulation

Removing Hard Constraints via Lagrangian Relaxation

Directly constraining \( \pi_1^{T_f} = \mathbb{P}_1 \) is hard. ENOT introduces a Lagrangian-like functional:

The Lagrangian-like functional for the relaxed Dynamic Schrödinger Bridge problem.

Equation (12): Objective with potential \( \beta(y) \) as Lagrange multiplier for matching the final marginal.

Here:

Drift network \( f_\theta \): defines the process to minimize \( \mathcal{L} \), reducing drift energy and generating final samples that score high under \( \beta \).
Potential network \( \beta_\phi \): scores samples to maximize \( \mathcal{L} \), pushing generated samples closer to the target distribution.

This adversarial setup forms a saddle-point problem:

The saddle-point optimization problem.

Equation (13): Maximizing over potentials, minimizing over drifts solves relaxed SB, hence EOT.

The marginal constraint emerges implicitly at equilibrium, avoiding expensive enforcement. All terms are estimable from samples, enabling stochastic gradient training.

Practical Algorithm — Entropic Neural OT

The ENOT algorithm alternates updates:

Critic update (\( \beta_\phi \)):
- Sample \( X_0 \sim \mathbb{P}_0 \), simulate process to get \( X_1 \).
- Sample \( Y \sim \mathbb{P}_1 \).
- Update \( \beta_\phi \) to maximize \( \frac{1}{|Y|}\sum\beta(Y) - \frac{1}{|X_1|}\sum\beta(X_1) \).
Generator update (\( f_\theta \)):
- Sample new \( X_0 \).
- Simulate \( X_t \) and drifts \( f_\theta(X_t, t) \).
- Minimize energy term (mean squared drift) plus adversarial term \(-\frac{1}{|X_1|}\sum\beta(X_1)\).

This repeats until convergence. The learned \( f_\theta \) defines both SB and EOT solutions.

Experiments

Toy 2D Example — Gaussian to 8 Gaussians

Mapping a square distribution to a ring of 8 Gaussians for different ε values. The top row shows the final learned distribution, and the bottom row shows the trajectories of the samples.

Figure 2: ENOT mappings for \( \varepsilon=0,0.01,0.1 \). Small \( \varepsilon \) → straight deterministic paths; larger → more stochasticity.

High-Dimensional Gaussians — Quantitative Evaluation

With Gaussian \( \mathbb{P}_0, \mathbb{P}_1 \), closed-form EOT/SB solutions exist. ENOT beats baselines in matching the target distribution and recovering the true plan:

Dim	ENOT	LSOT	SCONES	MLE-SB	DiffSB	FB-SDE-A	FB-SDE-J
2	0.01	1.82	1.74	0.41	0.70	0.87	0.03
16	0.09	6.42	1.87	0.50	1.11	0.94	0.05
64	0.23	32.18	6.27	1.16	1.98	1.85	0.19
128	0.50	64.32	6.88	2.13	2.20	1.95	0.39

Table 1: Error in matching marginals (lower better).

Dim	ENOT	LSOT	SCONES	MLE-SB	DiffSB	FB-SDE-A	FB-SDE-J
2	0.012	6.77	0.92	0.30	0.88	0.75	0.07
16	0.05	14.56	1.36	0.90	1.70	1.36	0.22
64	0.13	25.56	4.62	1.34	2.32	2.45	0.34
128	0.29	47.11	5.33	1.80	2.43	2.64	0.58

Table 2: Error in recovering true plan (lower better).

Performance gap increases with dimension.

Colored MNIST — Diversity Control

Qualitative comparison on Colored MNIST. ENOT (a-c) produces high-quality, diverse samples and achieves much better FID scores than competitors like DiffSB (d-e) and SCONES (i-j).

Figure 3: ENOT delivers sharp, diverse “3"s from “2"s even at small \( \varepsilon \).

ENOT achieves FID = 6.28 at \( \varepsilon = 1.0 \), outperforming SCONES (14.73) and DiffSB (93) which degrade badly at low \( \varepsilon \).

CelebA Faces — Unpaired Super-Resolution

Qualitative results for CelebA super-resolution. (a) Degraded test inputs, (b) Ground truth. SCONES (c) with high ε produces random faces. ENOT (f-h) with varying ε produces realistic details while preserving identity.

Figure 4: ENOT preserves identity while adding realistic detail; SCONES with \( \varepsilon=100 \) loses input structure.

ENOT scores FID = 3.78 (\( \varepsilon=0 \)), 7.63 (\( \varepsilon=1 \)), vs SCONES at 14.8. Diversity increases with \( \varepsilon \), balanced by fidelity.

Figure 1 shows progressive deblurring via ENOT’s learned process:

The learned diffusion process for CelebA super-resolution. Each row shows the transformation from a blurry input (left) to a sharp output (right) over time t. Higher ε (bottom rows) introduces more noise and diversity into the process.

Figure 1: Trajectories of samples for CelebA deblurring with \( \varepsilon=0,1,10 \).

Conclusion & Implications

The ENOT framework brings:

A new viewpoint: Casting EOT as a dynamic Schrödinger Bridge problem enables efficient, sample-based saddle-point optimization.
Stability for small \( \varepsilon \): Crucial for practical, high-fidelity mappings with controlled diversity.
State-of-the-art results: Superior performance in high dimensions and complex image-to-image translations.

Beyond super-resolution and digit translation, ENOT’s tunable stochastic mappings could impact style transfer, domain adaptation, and any application requiring flexible yet faithful distribution alignment. By uniting physics-inspired theory with neural optimization, this work advances both the methodology and the scope of entropic optimal transport.

Introduction: The Challenge of Aligning Complex Data Distributions#

Background: OT, EOT, and Schrödinger’s Bridge#

Optimal Transport — Moving Probability Mass Efficiently#

Entropic OT — Encouraging Stochasticity#

Schrödinger’s Bridge — Finding the Most Likely Path#

The Crucial Link — SB as EOT#

The ENOT Method — Saddle-Point Reformulation#

Removing Hard Constraints via Lagrangian Relaxation#

Practical Algorithm — Entropic Neural OT#

Experiments#

Toy 2D Example — Gaussian to 8 Gaussians#

High-Dimensional Gaussians — Quantitative Evaluation#

Colored MNIST — Diversity Control#

CelebA Faces — Unpaired Super-Resolution#

Conclusion & Implications#