Balancing Act: How Multi-Objective Optimization Boosts Learned Image Compression

In the world of digital media, we are constantly fighting a tug-of-war. On one side, we want high-quality images that look crisp and true to life (Low Distortion). On the other, we want files that are small enough to stream, store, and share instantly (Low Rate). This trade-off is the heart of image compression.

Traditional codecs like JPEG or HEVC solve this with hand-tuned engineering. But recently, Learned Image Compression (LIC)—using deep neural networks to compress images—has started to outperform these traditional methods. LIC models learn from data how to best represent an image.

However, training these models is tricky. You have to teach the network to minimize file size and maximize quality simultaneously. Often, these two goals fight each other during the training process, leading to a “balanced” model that isn’t actually optimal.

In this post, we’ll dive into a recent research paper that proposes a smarter way to train these networks. By treating compression as a Multi-Objective Optimization (MOO) problem, the researchers developed a method to dynamically balance the “Rate” and “Distortion” goals, squeezing out significantly better performance without changing the model architecture itself.

The Problem: When Objectives Collide

To understand the solution, we first need to understand how Learned Image Compression models are typically trained.

The Standard R-D Loss

In a standard setup, an LIC model is trained to minimize a loss function that combines two things:

  1. Rate (\(\mathcal{L}_R\)): The number of bits required to store the compressed code.
  2. Distortion (\(\mathcal{L}_D\)): The difference between the original image and the reconstructed image (often measured by Mean Squared Error).

These are combined using a trade-off parameter, \(\lambda\) (lambda). The loss function looks like this:

Standard Rate-Distortion Loss Equation

Here, \(\lambda\) acts like a volume knob. Turn it up, and the model focuses on quality (high bitrate). Turn it down, and the model focuses on compression (low bitrate).

The Imbalance

Ideally, as the model trains, it improves both rate and distortion simultaneously. In reality, the gradients—the signals telling the neural network how to update its weights—can be vastly different for Rate and Distortion.

Sometimes the gradient for Distortion is massive, while the gradient for Rate is tiny (or vice versa). Or perhaps they point in conflicting directions. When we simply add them together, the stronger signal drowns out the weaker one.

This leads to imbalanced optimization. The model might spend dozens of epochs fixing minor distortion errors while ignoring opportunities to save bits, or vice versa. The result is a model that settles for a “good enough” solution rather than the best possible one.

We can see this phenomenon clearly in the chart below:

Comparison of loss trends and improvement speeds

In Figure 1 (a), look at the dashed lines representing the “Standard” approach. Notice how the Distortion loss drops, but the bpp (bits per pixel) loss barely moves—it actually increases slightly. The objectives aren’t moving together.

Now look at the solid lines (the proposed “Balanced” approach). Both curves decrease smoothly and consistently. By balancing the optimization, the model learns to improve both quality and file size simultaneously.

The Solution: Multi-Objective Optimization (MOO)

The researchers propose abandoning the simple sum of losses. Instead, they frame Rate-Distortion optimization as a Multi-Objective Optimization problem.

MOO General Formulation

The goal is to find a parameter update that improves both objectives. If we view Rate and Distortion as two separate tasks, we want to find a gradient update direction that benefits both tasks as equally as possible.

Maximizing Improvement Speed

How do we define “equally”? We look at the improvement speed. If we update the model parameters \(\theta\) in a certain direction \(d_t\), how much does the loss drop relative to its current value?

The relative improvement speed for a task \(i\) (where \(i\) is either Rate or Distortion) is defined as:

Improvement speed equation

To ensure balanced learning, we want to maximize the minimum improvement speed. Think of it like a convoy of ships; the speed of the convoy is determined by the slowest ship. We want to make sure the “slowest” objective (the one lagging behind) gets the most attention.

This leads to a “minimax” optimization problem. We want to find a direction \(d_t\) that maximizes the worst-case improvement between Rate and Distortion:

Saddle point optimization problem

Don’t let the math scare you. This equation simply says: “Find the update direction \(d_t\) that guarantees the best possible improvement for whichever objective is currently struggling the most.”

The Dual Problem: Reweighting Gradients

Solving for the update direction \(d_t\) directly in a high-dimensional neural network (with millions of parameters) is computationally expensive. However, the researchers leverage a mathematical trick called Lagrangian duality.

Instead of finding the direction vector directly, they solve for weights (\(w_R\) and \(w_D\)) to apply to the gradients.

It turns out that finding the best direction \(d_t\) is mathematically equivalent to finding weights \(w\) that minimize the magnitude of the combined gradient.

Minimizing the weighted gradient norm

Here, \(J_t\) represents the gradients of the Rate and Distortion losses. We want to find weights \(w_t\) (where \(w_R + w_D = 1\)) such that the combined gradient vector is minimized. This specific weighted gradient points in the direction that satisfies our Pareto stationarity condition—improving both objectives efficiently.

Once we find these optimal weights, we compute the final update direction \(d_t\) as a weighted sum:

Balanced gradient calculation

There is one practical catch: the loss values \(\mathcal{L}\) can become very small, causing numerical instability. To fix this, the researchers use a normalization constant \(c_t\):

Normalization constant

This ensures the update steps don’t explode or vanish.

Two Strategies for Balance

The paper introduces two distinct ways to solve this weighting problem. One is an approximation suitable for training from scratch, and the other is an exact analytical solution perfect for fine-tuning.

Solution 1: Gradient Descent (Coarse-to-Fine)

When training a model from scratch, the landscape of the loss function changes rapidly. We don’t necessarily need the perfect weights at every single step; we just need to generally move in the right direction.

Solution 1 treats the weights \(w_t\) themselves as learnable parameters. At each step, it performs a quick gradient descent update on the weights to minimize the objective we defined above.

The weights are updated iteratively:

Weight update rule

To ensure the weights stay positive and sum to 1 (valid probability distribution), the researchers map them through a Softmax function:

Softmax mapping

They also add a decay term to the update rule, which helps stabilize the training by preventing the weights from oscillating too wildly based on a single batch of data:

Update with decay

Best for: Training a new LIC model from scratch. It’s computationally cheaper than Solution 2 and adapts progressively as the model learns.

Solution 2: Quadratic Programming (Analytical)

Sometimes, we want precision. If we are fine-tuning an already converged model, we want the exact optimal weights to squeeze out the last bit of performance.

The minimization problem we discussed earlier is actually a Quadratic Programming (QP) problem with a specific constraint (weights must sum to 1).

Quadratic Programming Formulation

Because the Hessian matrix \(Q\) (which relates the gradients of Rate and Distortion) is positive definite, there is a closed-form, analytical solution to this problem. We don’t need to iterate; we can just calculate the answer using matrix algebra.

Using Lagrange multipliers, the paper derives the exact formula for the optimal weights:

Analytical solution for weights

And finally, to ensure numerical stability and non-negativity, they project this solution onto the simplex using Softmax:

Softmax projection for QP solution

Best for: Fine-tuning existing models. It requires calculating and inverting a matrix (\(Q^{-1}\)), which adds computational cost, but it provides the mathematically optimal balance for every update.

Experimental Results

So, does balancing the gradients actually make a difference? The researchers tested their methods on standard datasets (Kodak, Tecnick, CLIC2022) using popular architectures like “M&S Hyperprior,” “ELIC,” and “TCM-S.”

R-D Performance

The results show a clear improvement. By simply changing how the model is optimized (without changing the model architecture itself), they achieved better compression performance.

R-D Curves Comparison

In Figure 2, we see the Rate-Distortion curves. The horizontal axis is Bitrate (bpp, lower is better for size), and the vertical axis is PSNR (quality, higher is better).

  • Solid lines (Solution 1 & 2) are consistently above the dashed lines (Standard training).
  • This means for the same file size, the Balanced method gives higher quality. Or, for the same quality, it yields a smaller file size.

The table below quantifies these gains using BD-Rate, a metric that measures the average bitrate saving for the same quality. A negative number means the file size is smaller.

BD-Rate Table

The proposed methods achieve around a 2% to 3% reduction in BD-Rate. In the world of image compression, where algorithms fight for fractions of a percent, a 2-3% gain purely from optimization strategy is a significant win.

Computational Cost

There is no free lunch. Calculating optimal weights requires extra computation during training.

Complexity Table

  • Solution 1 adds about 20% to the training time.
  • Solution 2 adds about 50% (due to matrix operations).

However, it is crucial to note that this cost applies only to training. The inference time (compressing/decompressing an image after the model is trained) remains exactly the same. The resulting model has the same architecture; it just has better weights.

Ablation Studies: Why the Details Matter

The researchers performed ablation studies to prove that their specific design choices were necessary.

Ablation Studies

  • Graph (a) - Renormalization: The orange and red lines show performance without the renormalization constant \(c_t\). They perform significantly worse. Normalizing the gradients is essential for convergence.
  • Graph (b) - Weight Decay: In Solution 1, using a decay term (\(\gamma\)) helps smooth out the updates. The blue square (\(\gamma=0.001\)) yields the best performance compared to other values.
  • Graph (c) - Scratch vs. Fine-tuning: This confirms the intended use cases. Solution 1 (Blue) is great for training from scratch. Solution 2 (Orange) is also good but slower. Interestingly, using Solution 1 for fine-tuning (Green) is not as effective as Solution 2.

Conclusion

The paper “Balanced Rate-Distortion Optimization in Learned Image Compression” highlights a fundamental issue in AI training: just because you sum two loss functions doesn’t mean they are being optimized equally.

By reframing image compression as a Multi-Objective Optimization problem, the authors provide a robust framework for handling the competing goals of Rate and Distortion.

  • Solution 1 offers a practical, iterative way to train balanced models from the ground up.
  • Solution 2 provides a mathematically precise tool for fine-tuning models to their theoretical limits.

For students and researchers in deep learning, this work serves as a reminder: sometimes the biggest gains don’t come from a new network architecture, but from simply helping the network learn more effectively. As LIC continues to mature, optimization techniques like these will likely become standard practice to close the gap with—and eventually surpass—traditional codecs.