Perception Over Pixels: Solving the Image Compression Trilemma with Wasserstein Distortion

There is an old project management adage that applies frustratingly well to engineering: “Good, Fast, Cheap. Pick two.”

In the world of image compression, this “impossible trinity” dictates the limits of our technology. You can have high visual fidelity (Good) and low file size (Cheap), but it will likely require computationally expensive, slow AI models to decode. Conversely, you can have a codec that is lightning fast and produces tiny files (like standard JPEGs at low quality), but the result will look blocky, blurry, and distinctly “digital.”

For years, research into learned image compression has leaned heavily into the “Good” and “Cheap” corners, utilizing massive generative models like GANs (Generative Adversarial Networks) or Diffusion models. These models can hallucinate realistic textures at incredibly low bitrates, but they are computational heavyweights. Deploying them on a mobile phone battery is currently a non-starter.

But what if the heavy lifting wasn’t required for the model, but rather for the metric used to train it?

In the paper “Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion,” researchers demonstrate that we don’t necessarily need massive generative networks to achieve generative-level quality. By taking a lightweight, “overfitted” neural codec and training it with a sophisticated model of human perception—Wasserstein Distortion (WD)—they achieved the holy grail: compression that is high quality, low bitrate, and computationally light enough for practical decoding.

The Landscape of Learned Compression

To understand why this approach is novel, we first need to look at the two dominant paradigms in modern image compression.

1. The Generative Approach

Recent advances in AI have birthed “generative” compression (e.g., HiFiC, diffusion-based codecs). These models treat decompression as a generative task. Instead of just reconstructing pixels, they attempt to sample from the distribution of natural images. If the encoder sends a compressed signal saying “this patch is grass,” the decoder uses its learned knowledge of the world to generate a convincing texture of grass. The results are visually stunning, avoiding the blurriness of traditional codecs. However, the computational cost (measured in MACs, or Multiply-Accumulate operations) is enormous.

2. The Overfitted Approach

On the other end of the spectrum are “overfitted” codecs, such as C3 (COOL-CHIC). Unlike a massive global network trained on millions of images, an overfitted codec learns a tiny, specific neural network for one single image. The network parameters are the bitstream.

The architecture of C3 is incredibly efficient. It rivals modern standards like VVC (Versatile Video Coding) in Rate-Distortion performance but is much simpler. However, C3 has traditionally been optimized using Mean Squared Error (MSE).

The Problem with MSE

MSE is the mathematical equivalent of playing it safe. It calculates the pixel-by-pixel difference between the original and the reconstruction. When a codec is unsure exactly where a blade of grass should go, minimizing MSE results in averaging all possibilities. The visual result? A blurry, brown-green smudge. MSE kills texture because it penalizes any deviation from the exact pixel location, even if the texture looks “real” to a human eye.

The Solution: Modeling Perception, Not Distributions

The researchers propose a shift in philosophy. Instead of trying to model the complex distribution of all natural images (which requires heavy generative models), they focus on modeling human visual perception.

They take the lightweight C3 codec and make two critical changes:

Wasserstein Distortion (WD): Replacing MSE with a loss function that measures “perceptual distance” and allows for texture resampling.
Common Randomness (CR): Providing the decoder with a source of noise to help synthesize stochastic textures (like gravel or clouds).

Let’s break down the architecture.

The C3 Architecture with Common Randomness

The baseline is the C3 codec. As shown in the figure below, the process involves decoding latent variables at multiple resolutions. These latents are upsampled and passed through a synthesis network (\(f_{\theta}\)) to create the final image.

Decoding an image with COOL-CHIC and C3. Panel A shows autoregressive decoding of latents. Panel B shows the synthesis process where common randomness is introduced.

The innovation here is the addition of Common Randomness (CR) (indicated by the brown squares in Panel B).

Imagine trying to paint a detailed granite rock. If you have to describe every speck of dust, it takes a lot of words (bits). But if you and the painter share a specific, identical brush that creates random speckles, you can just say “use the speckle brush here.”

In this system, the encoder and decoder share a fixed random seed (a pseudo-random number generator). This generates “noise maps” that are upsampled and concatenated with the image latents. The neural network learns to use this noise as a raw material to create textures. Because the seed is fixed, no extra bits are needed to send this noise; it is “free” detail.

The Core Engine: Wasserstein Distortion

The most significant contribution of this work is the practical implementation of Wasserstein Distortion (WD) as a training objective.

Standard metrics like MSE or SSIM compare images pixel-by-pixel or structure-by-structure. Perceptual metrics like LPIPS compare images in a “feature space” (using activations from a pre-trained network like VGG). WD takes this a step further by incorporating the biology of the human eye—specifically, the difference between foveal vision (center of gaze) and peripheral vision.

In our peripheral vision, we don’t see exact details. We see “summary statistics” of textures. If you look at a brick wall, your periphery registers “brick texture,” not the exact crack in the third brick to the left. WD models this by allowing the reconstruction to be different from the original, as long as the local statistics of the features match.

Calculating WD Efficiently

Calculating true Wasserstein distance is computationally expensive. The authors propose an efficient approximation using VGG features.

Computation of Wasserstein Distortion (WD). Panel A: Extract VGG features. Panel B: Compute local statistics (mean and variance). Panel C: Aggregate WD values.

The process works in three stages:

Feature Extraction: Run both the original and compressed image through a VGG network to get feature maps (\(f_i\)).
Local Statistics: Instead of comparing features directly, compute the local mean (\(\mu\)) and standard deviation (\(\nu\)) of the features within a certain pooling region. This represents the “texture” of that area.
Aggregation: Calculate the distance between these distributions.

The size of the pooling region is determined by a parameter called \(\sigma\) (sigma).

Small \(\sigma\): Small pooling region. The model must match features precisely. This mimics foveal vision (looking directly at an object).
Large \(\sigma\): Large pooling region. The model only needs to match the general “vibe” or statistics of the texture. This mimics peripheral vision.

To make this fast enough for optimization, the authors use a pyramid of pre-computed statistics at power-of-two scales. They then interpolate the loss for any specific \(\sigma\) value using the following equations:

Equation for adapting sigma pooling regions based on feature resolution.

Equation for determining weight maps to interpolate between pre-computed scales.

This approach creates a “differentiable” loss function. The neural network can learn exactly how to manipulate the Common Randomness to satisfy these statistical texture constraints.

The Role of Saliency

If we use a large \(\sigma\) everywhere, the whole image might look like a “texture” version of itself—great for grass, but terrible for text or faces, which become jumbled. If we use a small \(\sigma\) everywhere, we revert to MSE-like behavior, spending too many bits on noise that doesn’t matter.

The solution is Saliency. The researchers use EML-net, a saliency prediction network, to guess where a human is likely to look.

They convert the saliency map (\(s\)) into a density map (\(p\)) and finally into a variable \(\sigma\)-map:

Equation for converting saliency to density probability.

Equation for deriving the sigma map from density.

High Saliency (Eye gazing here): High \(p\) \(\rightarrow\) Low \(\sigma\). The codec is forced to reconstruct exact details.
Low Saliency (Peripheral): Low \(p\) \(\rightarrow\) High \(\sigma\). The codec is allowed to hallucinate statistically similar texture, saving bits.

Experimental Results

Does it work? The results are compelling.

Visual Quality vs. Bitrate

The visual difference is stark. In the comparison below, look at the grass. The MSE version (top right) smears the grass into a blur. The WD version (bottom left) retains the grassy texture. Interestingly, the bottom right image shows what happens without Common Randomness: the model tries to synthesize texture using deterministic lines, creating artifacts on the roof. CR allows for natural, stochastic noise shaping.

Comparison of C3 variants on a barn image. WD preserves texture in the grass significantly better than MSE.

The “Good, Cheap, Fast” Trade-off

The chart below summarizes the paper’s main achievement. The y-axis represents the Elo score (a measure of human preference from a rater study), and the x-axis represents the bitrate.

HiFiC (Green line): High quality, low bitrate, but extreme complexity (see the right-hand graph showing MACs).
C3/WDs (Orange line - Ours): Matches the quality of HiFiC but with less than 1% of the computational cost (decoder complexity).

Left: Elo score vs Bitrate showing C3/WDs performance. Right: Decoder complexity showing C3 is orders of magnitude faster than HiFiC.

This graph effectively proves the thesis: you can have the quality of a generative model without the massive decoder network.

The Importance of Saliency

The impact of the saliency-guided \(\sigma\)-map is evident in images containing text. In the figure below, the “flat \(\sigma\)” version (Top Right) treats the text on the camera lens as a random texture, scrambling the letters. The saliency-guided version (Bottom Right) recognizes the text as important, lowers the \(\sigma\) for that region, and preserves legibility while still hallucinating the skin texture elsewhere.

Comparison of text legibility. Flat sigma scrambles the text ‘ZENITAR’, while saliency-guided sigma preserves it.

Also note the comparison to C3/wMSE (Bottom Center). Simply weighting the MSE loss by saliency doesn’t help synthesize texture; it just makes the blurry areas slightly less blurry. You need the Wasserstein geometric properties to get the texture generation.

Bit Allocation

How does the model achieve this efficiency? By offloading “detail” to the “texture” generator. The chart below shows the bit allocation across the varying resolution arrays of the codec.

Bit allocation chart. WD optimized models allocate fewer bits to the highest resolution array (blue) compared to MSE models.

Notice that the WD models (center/right bars) spend significantly fewer bits on Array 1 (the highest resolution layer, shown in blue) compared to the MSE model (left bars). Instead of explicitly coding every high-frequency pixel change, the WD model relies on the lower-resolution latents and the Common Randomness to synthesize those details during decoding.

WD as a Metric for Evaluation

An unexpected but significant finding was how well Wasserstein Distortion performed as an Image Quality Assessment (IQA) metric. The authors compared how well different metrics predicted human ratings (Elo scores).

Table showing correlations between metrics and human ratings. WD8 achieves over 93% correlation.

As shown above, WD8 (Wasserstein Distortion with \(\sigma=8\)) achieved a Pearson correlation of 0.936 with human ratings, vastly outperforming standard metrics like MS-SSIM (0.540) and even learned metrics like LPIPS (0.711). This suggests that WD isn’t just a good loss function; it’s a highly accurate mathematical proxy for how humans perceive image quality.

Additional Visual Examples

The texture synthesis capabilities are consistent across various scenes.

Comparison with LPIPS: In this waterfall scene, the LPIPS-optimized model (center) smears the person in the foreground. The WD model (right) manages to preserve the foreground details while faithfully reconstructing the chaotic texture of the water.

Waterfall image comparison. WD preserves foreground detail better than LPIPS while maintaining water texture.

Street Scenes and Signage: Here we see another example of the saliency trade-off. The flat WD model (Center) creates a nice street texture but mangles the “POPPIE’S” sign. The saliency-guided model (Right) detects the high contrast text, protecting it from texture resampling, resulting in readable text and realistic brickwork.

Street scene comparison. Saliency-guided WD preserves the ‘POPPIE’S’ signage while synthesizing brick textures.

Landscape and Vegetation: In this mountain shot, the MSE optimization (Left) creates “staircasing” artifacts and flat vegetation. The WD version (Right) synthesizes believable foliage using 15% fewer bits.

Mountain landscape comparison. WD improves vegetation texture and eliminates staircasing artifacts found in MSE.

Conclusion: Breaking the Triangle

The paper “Good, Cheap, and Fast” challenges the prevailing assumption that high-fidelity “generative” image compression requires massive, slow neural networks.

By combining an efficient, overfitted architecture (C3) with a perception-aligned loss function (Wasserstein Distortion) and a splash of randomness (Common Randomness), the authors have created a codec that:

Looks Great: Comparable to HiFiC and superior to standard codecs.
Is Small: Competitive bitrates for the quality provided.
Runs Fast: Decodes orders of magnitude faster than diffusion or GAN-based approaches.

This work highlights a crucial lesson for AI research: sometimes, the bottleneck isn’t the model capacity, but the objective function. By telling the network how to see (using Wasserstein distance and Saliency) rather than just what to match (pixels), we can achieve efficiency that was previously thought impossible.

For the future of media streaming and storage, this implies that the next generation of visual fidelity might not come from bigger chips, but from smarter math.

The Landscape of Learned Compression#

1. The Generative Approach#

2. The Overfitted Approach#

The Problem with MSE#

The Solution: Modeling Perception, Not Distributions#

The C3 Architecture with Common Randomness#

The Core Engine: Wasserstein Distortion#

Calculating WD Efficiently#

The Role of Saliency#

Experimental Results#

Visual Quality vs. Bitrate#

The “Good, Cheap, Fast” Trade-off#

The Importance of Saliency#

Bit Allocation#

WD as a Metric for Evaluation#

Additional Visual Examples#

Conclusion: Breaking the Triangle#