Imagine you have a race car. You can tune the engine yourself (manual optimization), or you can train an AI to tune it for you (Learning-to-Optimize). The AI version is often significantly faster, zooming past the finish line while you’re still tweaking the carburetor.
But there is a catch: Can you trust the AI?
In classical optimization (like Gradient Descent or Adam), we have mathematical proofs guaranteeing that, eventually, the car will stop at the finish line (a critical point). In Learning-to-Optimize (L2O), the algorithm is often a neural network—a “black box.” Historically, proving that this black box will actually converge has been a nightmare. To get a guarantee, researchers often had to “safeguard” the AI, essentially forcing it to behave like a slow, classical algorithm, which defeats the purpose of using AI in the first place.
In this post, we will dive into a fascinating paper that proposes a new way to solve this dilemma. Instead of forcing the AI to be simple, the authors use probabilistic generalization to prove that if an algorithm looks like it’s converging during training, it will probably converge during testing.
The Core Problem: Speed vs. Safety
Learning-to-optimize (L2O) leverages machine learning to accelerate optimization algorithms. You train a model (usually a Recurrent Neural Network or similar) on a distribution of problems, and it learns update rules that are often much faster than hand-crafted rules.
However, theoretical guarantees are missing. In traditional optimization, convergence is proved using geometric arguments (e.g., “the slope always points down, so we must go down”). In L2O, the update step is a complex, non-linear function output by a neural network. You cannot easily apply geometric logic to it.
This leads to two camps in current research:
- No Guarantees: Just train it and hope it works. (Fast, but risky).
- Safeguards: Force the output of the neural network to satisfy strict geometric conditions. (Safe, but restricts performance—often degrading it back to the level of classical algorithms).
The authors of this paper ask: Can we keep the speed of the black box but still get a guarantee?
A New Perspective: Generalization
The authors advocate for a shift in perspective. Instead of treating the L2O model as a static geometric object, they treat it as a statistical learner.
The main advantage we have in L2O is that we can observe the algorithm during training. We can watch thousands of trajectories. If the algorithm reliably converges on the training data, can we mathematically promise it will converge on unseen test data?
This brings us to the core logic of the paper, illustrated simply below:

Here is the intuition behind the formula above:
- Let C be the set of sequences that converge (what we want).
- Observing convergence directly is hard because it is an “asymptotic” property—it happens at infinity.
- However, there are observable properties—let’s call them A and B—that mathematically imply convergence.
- If we observe A and B happening, we know C is happening.
- Therefore, the probability of A and B occurring is a lower bound on the probability of convergence.
The Mathematical Framework
Let’s formalize the problem. We want to minimize a loss function \(\ell(z, p)\), where \(z\) is our variable and \(p\) represents the parameters of the specific problem instance:

Our learned algorithm \(\mathcal{A}\) updates the state \(z\) iteratively based on hyperparameters \(h\), problem parameters \(p\), and some internal randomness \(r\):

The goal is to prove that the sequence produced by this update converges to a critical point (where the gradient is zero).
The “Observable” Proxies for Convergence
Since we cannot run the algorithm for infinite time to check convergence, the authors utilize a powerful result from variational analysis known as the Kurdyka-Lojasiewicz (KL) framework.
This framework states that a sequence will converge to a critical point if it satisfies three specific conditions. The authors translate these conditions into measurable sets.
1. The Sufficient-Descent Condition (\(A_{desc}\)) The algorithm must decrease the loss value by a certain amount proportional to the step size squared. This ensures we are actually making progress.

2. The Relative-Error Condition (\(A_{err}\)) The size of the gradient (or subgradient \(v\)) must be bounded by the step size. This links the geometry of the function to the movement of the algorithm.

3. The Boundedness Condition (\(A_{bound}\)) The trajectory of the algorithm shouldn’t shoot off to infinity. It must stay within a finite region.

The Critical Implication If a sequence satisfies all three of these conditions (and the function is a “KL-function,” which covers most practical problems), then the sequence must converge to a critical point.
We can express this relationship probabilistically:

This inequality is the key. We can measure the left side (Descent, Error, Boundedness) during training. Because the left side implies the right side, establishing a high probability for the left side guarantees a high probability for convergence.
The Generalization Theorem
Now that we have observable conditions, how do we ensure they hold for new data? This is where PAC-Bayesian Theory comes in.
PAC-Bayes (Probably Approximately Correct, Bayesian style) allows us to bound the “risk” (failure rate) on test data based on the risk observed on training data, plus a “complexity” penalty.
The authors derive a specific theorem for this setting. It looks intimidating, but let’s break it down:

- Left Side (\(\rho[\dots]\)): The true probability of convergence on unseen data (what we want to know).
- Right Side (The Bound):
- \(\frac{1}{N}\sum...\): The empirical failure rate observed during training (how often did conditions A, B, and C fail?).
- \(D_{KL}\): A term measuring how complex our learned model is (specifically, how far the learned “posterior” parameters \(\rho\) moved from the initial “prior” parameters).
- \(\Phi^{-1}\): A function that essentially says “the true risk is close to the empirical risk, plus a margin of error.”
Simplifying the Result: If we denote our observable conditions (Descent + Error + Boundedness) as set A, the theorem allows us to state with high confidence (\(1-\epsilon\)) that:

In plain English: “The probability that our AI optimizer converges on a new problem is at least as high as the rate we observed during training, minus a small statistical penalty.”
Measurability: The Hidden Challenge
While the logic above sounds straightforward, the paper dedicates significant effort to Measurability. In advanced probability theory, you cannot assign a probability to a set if it isn’t “measurable.”
Defining the set of “all sequences that converge to a critical point” (\(A_{conv}\)) is mathematically tricky because it involves limits and existential quantifiers.

The authors prove—through rigorous lemmas provided in the appendices—that sets like \(A_{conv}\) and the set of critical points \(A_{crit}\) are indeed measurable.

This theoretical legwork is crucial. Without it, the application of the PAC-Bayesian theorem would be mathematically invalid.
Experiments: Does it work?
The authors put their theory to the test on two distinct tasks: solving quadratic problems and training a neural network.
1. Quadratic Problems
They trained an algorithm to solve quadratic minimization problems. The structure of the algorithm was a specialized neural network that processes gradients and momentum.
The Problem:

The Algorithm Architecture: They used a “coordinate-wise” architecture using 1x1 convolutions. This allows the model to process input dimensions independently but share learned weights, making it efficient.

The Results: The learned algorithm was compared against the “Heavy-ball with friction” (HBF) method, a strong baseline.

- Left Plot: The learned algorithm (pink) converges significantly faster than the baseline (blue).
- Right Plot: This is the most interesting part.
- The Yellow bar is the estimated probability of satisfying the conditions (Descent, Error, Boundedness).
- The Purple bar is the actual convergence probability (which they could check since this is a controlled experiment).
- The Orange Dotted Line is the PAC-Bound predicted by the theorem.
- Takeaway: The theoretical bound (orange line) accurately predicts a safe lower bound for the true convergence. The theory holds.
2. Training a Neural Network
Next, they applied the method to a non-convex, non-smooth problem: training a small neural network to perform regression.
The Algorithm Architecture: This architecture is slightly more complex, using attention-like mechanisms to weight different input features (gradients, momentum, loss differences) before predicting an update step.

The Results: They compared the learned optimizer against Adam, the industry standard.

- Left/Middle Plots: The learned algorithm (pink) drives the loss down faster than Adam (blue).
- Right Plot: The PAC-bound (orange dotted line) predicts a success rate of over 90%. This means we can guarantee, with high probability, that this “black box” optimizer will find a critical point on new, unseen data.
Conclusion
This paper bridges a massive gap in Learning-to-Optimize. For years, practitioners had to choose between the raw speed of learned algorithms and the safety of traditional mathematical proofs.
By combining Variational Analysis (to define what convergence “looks like”) with PAC-Bayesian Theory (to prove those looks generalize), the authors provide a framework where we can trust the black box. We don’t need to handcuff the AI with strict safeguards; we just need to observe it carefully during training.
This opens the door for high-performance, specialized optimizers that are both blazing fast and theoretically reliable. Future work looks to extend this to stochastic optimization (where the data has noise), which would make this applicable to even larger-scale deep learning tasks.
](https://deep-paper.org/en/paper/2410.07704/images/cover.png)