Trusting the Black Box: Proving Convergence for Learned Optimizers

Imagine you have a race car. You can tune the engine yourself (manual optimization), or you can train an AI to tune it for you (Learning-to-Optimize). The AI version is often significantly faster, zooming past the finish line while you’re still tweaking the carburetor.

But there is a catch: Can you trust the AI?

In classical optimization (like Gradient Descent or Adam), we have mathematical proofs guaranteeing that, eventually, the car will stop at the finish line (a critical point). In Learning-to-Optimize (L2O), the algorithm is often a neural network—a “black box.” Historically, proving that this black box will actually converge has been a nightmare. To get a guarantee, researchers often had to “safeguard” the AI, essentially forcing it to behave like a slow, classical algorithm, which defeats the purpose of using AI in the first place.

In this post, we will dive into a fascinating paper that proposes a new way to solve this dilemma. Instead of forcing the AI to be simple, the authors use probabilistic generalization to prove that if an algorithm looks like it’s converging during training, it will probably converge during testing.

The Core Problem: Speed vs. Safety

Learning-to-optimize (L2O) leverages machine learning to accelerate optimization algorithms. You train a model (usually a Recurrent Neural Network or similar) on a distribution of problems, and it learns update rules that are often much faster than hand-crafted rules.

However, theoretical guarantees are missing. In traditional optimization, convergence is proved using geometric arguments (e.g., “the slope always points down, so we must go down”). In L2O, the update step is a complex, non-linear function output by a neural network. You cannot easily apply geometric logic to it.

This leads to two camps in current research:

No Guarantees: Just train it and hope it works. (Fast, but risky).
Safeguards: Force the output of the neural network to satisfy strict geometric conditions. (Safe, but restricts performance—often degrading it back to the level of classical algorithms).

The authors of this paper ask: Can we keep the speed of the black box but still get a guarantee?

A New Perspective: Generalization

The authors advocate for a shift in perspective. Instead of treating the L2O model as a static geometric object, they treat it as a statistical learner.

The main advantage we have in L2O is that we can observe the algorithm during training. We can watch thousands of trajectories. If the algorithm reliably converges on the training data, can we mathematically promise it will converge on unseen test data?

This brings us to the core logic of the paper, illustrated simply below:

Set inclusion logic: If x satisfies properties a and b, it is a subset of c.

Here is the intuition behind the formula above:

Let C be the set of sequences that converge (what we want).
Observing convergence directly is hard because it is an “asymptotic” property—it happens at infinity.
However, there are observable properties—let’s call them A and B—that mathematically imply convergence.
If we observe A and B happening, we know C is happening.
Therefore, the probability of A and B occurring is a lower bound on the probability of convergence.

The Mathematical Framework

Let’s formalize the problem. We want to minimize a loss function \(\ell(z, p)\), where \(z\) is our variable and \(p\) represents the parameters of the specific problem instance:

Minimization problem definition.

Our learned algorithm \(\mathcal{A}\) updates the state \(z\) iteratively based on hyperparameters \(h\), problem parameters \(p\), and some internal randomness \(r\):

The algorithmic update equation.

The goal is to prove that the sequence produced by this update converges to a critical point (where the gradient is zero).

The “Observable” Proxies for Convergence

Since we cannot run the algorithm for infinite time to check convergence, the authors utilize a powerful result from variational analysis known as the Kurdyka-Lojasiewicz (KL) framework.

This framework states that a sequence will converge to a critical point if it satisfies three specific conditions. The authors translate these conditions into measurable sets.

1. The Sufficient-Descent Condition (\(A_{desc}\)) The algorithm must decrease the loss value by a certain amount proportional to the step size squared. This ensures we are actually making progress.

Definition of the Sufficient-Descent set.

2. The Relative-Error Condition (\(A_{err}\)) The size of the gradient (or subgradient \(v\)) must be bounded by the step size. This links the geometry of the function to the movement of the algorithm.

Definition of the Relative-Error set.

3. The Boundedness Condition (\(A_{bound}\)) The trajectory of the algorithm shouldn’t shoot off to infinity. It must stay within a finite region.

Definition of the Boundedness set.

The Critical Implication If a sequence satisfies all three of these conditions (and the function is a “KL-function,” which covers most practical problems), then the sequence must converge to a critical point.

We can express this relationship probabilistically:

Inequality showing that the intersection of descent, error, and bound sets implies convergence.

This inequality is the key. We can measure the left side (Descent, Error, Boundedness) during training. Because the left side implies the right side, establishing a high probability for the left side guarantees a high probability for convergence.

The Generalization Theorem

Now that we have observable conditions, how do we ensure they hold for new data? This is where PAC-Bayesian Theory comes in.

PAC-Bayes (Probably Approximately Correct, Bayesian style) allows us to bound the “risk” (failure rate) on test data based on the risk observed on training data, plus a “complexity” penalty.

The authors derive a specific theorem for this setting. It looks intimidating, but let’s break it down:

The main PAC-Bayesian theorem applied to convergence.

Left Side (\(\rho[\dots]\)): The true probability of convergence on unseen data (what we want to know).
Right Side (The Bound):
\(\frac{1}{N}\sum...\): The empirical failure rate observed during training (how often did conditions A, B, and C fail?).
\(D_{KL}\): A term measuring how complex our learned model is (specifically, how far the learned “posterior” parameters \(\rho\) moved from the initial “prior” parameters).
\(\Phi^{-1}\): A function that essentially says “the true risk is close to the empirical risk, plus a margin of error.”

Simplifying the Result: If we denote our observable conditions (Descent + Error + Boundedness) as set A, the theorem allows us to state with high confidence (\(1-\epsilon\)) that:

Simplified inequality relating true convergence probability to empirical estimates.

In plain English: “The probability that our AI optimizer converges on a new problem is at least as high as the rate we observed during training, minus a small statistical penalty.”

Measurability: The Hidden Challenge

While the logic above sounds straightforward, the paper dedicates significant effort to Measurability. In advanced probability theory, you cannot assign a probability to a set if it isn’t “measurable.”

Defining the set of “all sequences that converge to a critical point” (\(A_{conv}\)) is mathematically tricky because it involves limits and existential quantifiers.

Definition of the set of convergent sequences.

The authors prove—through rigorous lemmas provided in the appendices—that sets like \(A_{conv}\) and the set of critical points \(A_{crit}\) are indeed measurable.

Definition of the set of critical points.

This theoretical legwork is crucial. Without it, the application of the PAC-Bayesian theorem would be mathematically invalid.

Experiments: Does it work?

The authors put their theory to the test on two distinct tasks: solving quadratic problems and training a neural network.

1. Quadratic Problems

They trained an algorithm to solve quadratic minimization problems. The structure of the algorithm was a specialized neural network that processes gradients and momentum.

The Problem: Quadratic minimization problem equation.

The Algorithm Architecture: They used a “coordinate-wise” architecture using 1x1 convolutions. This allows the model to process input dimensions independently but share learned weights, making it efficient.

Algorithm architecture for quadratic problems.

The Results: The learned algorithm was compared against the “Heavy-ball with friction” (HBF) method, a strong baseline.

Results for quadratic problems.

Left Plot: The learned algorithm (pink) converges significantly faster than the baseline (blue).
Right Plot: This is the most interesting part.
The Yellow bar is the estimated probability of satisfying the conditions (Descent, Error, Boundedness).
The Purple bar is the actual convergence probability (which they could check since this is a controlled experiment).
The Orange Dotted Line is the PAC-Bound predicted by the theorem.
Takeaway: The theoretical bound (orange line) accurately predicts a safe lower bound for the true convergence. The theory holds.

2. Training a Neural Network

Next, they applied the method to a non-convex, non-smooth problem: training a small neural network to perform regression.

The Algorithm Architecture: This architecture is slightly more complex, using attention-like mechanisms to weight different input features (gradients, momentum, loss differences) before predicting an update step.

Algorithm architecture for neural network training.

The Results: They compared the learned optimizer against Adam, the industry standard.

Results for neural network training.

Left/Middle Plots: The learned algorithm (pink) drives the loss down faster than Adam (blue).
Right Plot: The PAC-bound (orange dotted line) predicts a success rate of over 90%. This means we can guarantee, with high probability, that this “black box” optimizer will find a critical point on new, unseen data.

Conclusion

This paper bridges a massive gap in Learning-to-Optimize. For years, practitioners had to choose between the raw speed of learned algorithms and the safety of traditional mathematical proofs.

By combining Variational Analysis (to define what convergence “looks like”) with PAC-Bayesian Theory (to prove those looks generalize), the authors provide a framework where we can trust the black box. We don’t need to handcuff the AI with strict safeguards; we just need to observe it carefully during training.

This opens the door for high-performance, specialized optimizers that are both blazing fast and theoretically reliable. Future work looks to extend this to stochastic optimization (where the data has noise), which would make this applicable to even larger-scale deep learning tasks.

The Core Problem: Speed vs. Safety#

A New Perspective: Generalization#

The Mathematical Framework#

The “Observable” Proxies for Convergence#

The Generalization Theorem#

Measurability: The Hidden Challenge#

Experiments: Does it work?#

1. Quadratic Problems#

2. Training a Neural Network#

Conclusion#