Introduction: The High-Stakes Game of Minimax Optimization

Imagine a game with two players. One player—the Minimizer—wants to make a certain value as small as possible. The other—the Maximizer—wants to make that same value as large as possible. They take turns, each trying to outsmart the other. This is the essence of a minimax problem, a concept that sits at the heart of many modern machine learning marvels.

From training Generative Adversarial Networks (GANs) that create photorealistic images to building AI models robust to adversarial attacks, minimax optimization is everywhere. The core task is to solve problems of the form:

The general form of a minimax optimization problem.

The standard mathematical form of a minimax optimization problem.

Here, we want to find a point \( x \) that minimizes the function \( f(x, y) \) after the second player has maximized it over \( y \).

When \( f \) is convex in \( x \) and concave in \( y \) (the convex-concave setting), efficient algorithms abound. But in many cutting-edge applications, \( f(x, y) \) is nonconvex for the minimizer and nonconcave for the maximizer. This “wild west” of optimization—Nonconvex–Nonconcave (NC-NC)—is notoriously difficult and computationally expensive. Existing algorithms are either too complex (multi-loop structures) or too slow, with convergence rates that are impractical in large-scale problems.

This is where a recent NeurIPS 2023 paper, A Single-Loop Accelerated Extra-Gradient Difference Algorithm with Improved Complexity Bounds for Constrained Minimax Optimization, makes its entrance. The authors propose the Extra-Gradient Difference Acceleration (EGDA) algorithm: a single-loop method—simple to implement yet dramatically faster—that achieves a theoretical complexity of \( O(\epsilon^{-2}) \), shattering previous bounds of \( \widetilde{\mathcal{O}}(\epsilon^{-4}) \) and \( \widetilde{\mathcal{O}}(\epsilon^{-2.5}) \).

In this post, we’ll explore why minimax optimization is so challenging, how EGDA works, and how it performs in practice.


Background: The Landscape of Minimax Algorithms

Flavors of Minimax Problems

Minimax problems vary based on the “shape” of \( f(x, y) \):

  • Convex–Concave (C-C): Convex in \( x \) (like a bowl), concave in \( y \) (like an upside-down bowl). This is the most well-behaved and easiest setting.
  • Nonconvex–Concave (NC-C): Nonconvex in \( x \) but concave in \( y \). Common in adversarial training of deep neural networks.
  • Nonconvex–Nonconcave (NC-NC): Nonconvex in \( x \) and nonconcave in \( y \). The most general and difficult case—finding global solutions is NP-hard.

This paper focuses on the challenging NC-C and NC-NC settings.

Existing Approaches

The simplest approach is Gradient Descent–Ascent (GDA): move \( x \) in the direction of \(-\nabla_x f\) and \( y \) in the direction of \(+\nabla_y f\). Unfortunately, GDA can cycle without converging for many problems.

To improve stability:

  • Multi-Loop Algorithms: An outer loop updates one variable, inner loops solve optimization subproblems for the other variable. They can be theoretically sound but slow in practice.
  • Single-Loop Algorithms: More practical as both variables are updated in one loop. Smoothed-GDA adds a smoothing term to \( f \) to improve behavior but often yields non-optimal complexities.

Another class is Extra-Gradient (EG) methods, which use a “predict–correct” sequence: take a prediction step, evaluate the gradient at that new point, then use it to make a correction from the original position.

The general predict-correct update rule for extra-gradient methods.

The extra-gradient framework adds a stabilizing prediction step before the main update.


Complexity Landscape

The tables below compare algorithm complexities. A smaller exponent of \( \epsilon \) means faster convergence toward an \(\epsilon\)-stationary point.

Table 1 from the paper, comparing the complexity of different algorithms for finding a stationary point of the function f. The proposed work achieves the best complexity in all listed categories.

Table 1: EGDA leads with \( O(\epsilon^{-2}) \) complexity across NC-C, C-NC, and NC-NC.

Table 2 from the paper, comparing complexities for finding a stationary point of the primal function φ in the NC-C setting. The proposed work improves the best-known result from O(ε⁻³) to O(ε⁻²).

Table 2: EGDA improves best-known result for the primal function \( \phi \) from \( O(\epsilon^{-3}) \) to \( O(\epsilon^{-2}) \).

The challenge: create a single-loop, directly accelerated algorithm beating state-of-the-art complexities for difficult minimax problems.


The Core Method: Inside the EGDA Algorithm

The EGDA algorithm combines a standard primal update for \( x \) with a novel triple-part accelerated update for \( y \):

  1. Primal Gradient Descent (x):

The update rule for the primal variable x.

A proximal gradient descent step moves \( x \) toward decreasing \( f \) and projects it back into \( \mathcal{X} \).

  1. Extra-Gradient Difference Prediction (y):

The novel extra-gradient difference prediction step for the dual variable y.

A prediction point \( u_{t+1/2} \) using the difference of gradients at earlier iterates.

This gradient-difference term acts like a second-order correction. Without computing Hessians, it captures curvature information to make more informed updates.

  1. Gradient Ascent Correction (y):

The gradient ascent correction step, which uses the new prediction point.

A standard extra-gradient correction from \( y_t \) using the gradient at \( u_{t+1/2} \).

  1. Momentum Acceleration (y):

The momentum acceleration step to finalize the update for y.

Combining \( y_t \) and \( u_{t+1} \) via momentum improves convergence speed.

This dual-acceleration scheme—a gradient-difference prediction, correction, and momentum—is EGDA’s core innovation.


Theoretical Linchpin: Quasi-Cocoercivity

In optimization, cocoercivity helps guarantee convergence but rarely holds in NC-NC problems. EGDA constructs a property called quasi-cocoercivity:

The definition of cocoercivity. It relates the inner product of gradient differences to the squared norm of the gradient differences.

Cocoercivity links gradient inner products to their norms—rare in nonconvex–nonconcave problems.

By carefully choosing \( u_{t+1/2} \), EGDA ensures a property like cocoercivity exactly where needed in the proof:

Property 1 from the paper, defining the quasi-cocoercivity property.

Quasi-cocoercivity: a cocoercivity-like property at constructed points.


Convergence Guarantees and Experiments

Optimality Measure

An \(\epsilon\)-stationary point is where the gradient mapping \( \pi(\bar{x}, \bar{y}) \) is small:

Definition 1 from the paper, defining the gradient mapping used to measure if a point is an ε-stationary point of f.

The analysis uses a potential function \( G_t \):

The potential function G_t used in the convergence analysis.

\( G_t \) tracks objective value and progress indicators, guaranteed to decrease per iteration.


Main Results

Theorem 1 (NC-NC):

The complexity bound for EGDA in the constrained NC-NC setting.

EGDA achieves \( O(\epsilon^{-2}) \) for NC-NC problems—unprecedented for a single-loop algorithm.

Theorem 2 (NC-C, primal \( \phi \)):

The complexity bound for EGDA in the constrained NC-C setting for the primal function φ.

EGDA improves best-known NC-C \( \phi \) complexity from \( \widetilde{\mathcal{O}}(\epsilon^{-3}) \) to \( O(\epsilon^{-2}) \).


Experimental Results

1. Toy NC-NC Problem:

Figure 1 from the paper, showing EGDA (black diamonds) converging much faster than other methods on a sample NC-NC problem.

EGDA’s convergence (black diamonds) is markedly faster than GDA, EG, and FEG in both distance to solution and gradient norm.

2. Convex-Concave Problem:

Figure 2 from the paper, demonstrating EGDA’s fast convergence on a convex-concave problem.

EGDA outpaces GDA and EAG in convergence speed even in convex–concave settings.

3. Adversarially Robust Neural Network Training (NC-C):

Figure 3 from the paper, comparing the training loss of different algorithms on robust neural network training. EGDA (red solid line) decreases the loss fastest on both datasets.

On MNIST and Fashion-MNIST, EGDA’s faster theoretical rate translates into lower loss quicker than Smoothed-GDA, MGDA, and GDA.


Conclusion and Implications

By introducing the Extra-Gradient Difference Acceleration (EGDA) algorithm, this work achieves a leap in speed and simplicity for minimax optimization:

  • Novel Acceleration Scheme: A dual-variable acceleration with a gradient-difference step capturing second-order effects from first-order data.
  • State-of-the-Art Complexity: \( O(\epsilon^{-2}) \) for constrained NC-NC, NC-C, and C-NC, improving or matching the best known results with weaker assumptions.
  • Practical Impact: Single-loop simplicity, proven faster convergence, and real-world performance boosts—in adversarial training and beyond.

This new approach not only delivers a better algorithm but also a fresh analytical technique—crafted quasi-cocoercivity—for tackling nonconvex–nonconcave optimization. The EGDA framework shows that with thoughtful design, we can break the speed limit in optimization, opening doors for faster, more robust AI across GANs, reinforcement learning, and adversarial defense.