Introduction

In the intersection of Economics, Computer Science, and AI, few concepts are as pivotal as the Nash Equilibrium (NE). It describes a state in a game where no player can benefit by changing their strategy while everyone else keeps theirs unchanged. From poker bots to automated financial trading and multi-agent robotics, finding an NE is often the ultimate goal.

However, calculating an NE is notoriously difficult. As games scale up—think of a game with dozens of players or thousands of possible moves—the computational complexity explodes. This is where Machine Learning (ML) enters the arena. Modern ML has revolutionized optimization, solving complex non-convex problems in image recognition and NLP. Naturally, researchers have asked: Can we use the stochastic optimization power of ML to find Nash Equilibria in massive games?

The answer is yes, but there is a major catch: Variance.

When games are too large to fit in memory, we must rely on sampling—simulating small parts of the game to estimate the whole. Most existing methods for doing this are either mathematically biased (they don’t actually aim for the true NE) or, if they are unbiased, they suffer from extreme variance (the estimates are so noisy that the learning algorithm struggles to converge).

In this post, we will dive deep into a research paper that proposes a solution to this dilemma: Nash Advantage Loss (NAL). We will explore how this new loss function leverages a clever mathematical insight—specifically regarding unbiased gradient estimation—to reduce variance by orders of magnitude, allowing for faster and more accurate game solving.

The Background: Why is this Hard?

To understand the solution, we first need to understand the scale of the problem.

Normal-Form Games and the Curse of Dimensionality

A Normal-Form Game (NFG) is the standard representation of a game where players move simultaneously. It consists of:

Players: A set of agents.
Actions: A list of possible moves for each player.
Utilities (Payoffs): A mapping that tells us how much reward a player gets for a specific combination of actions by all players.

Strategies are represented as probability distributions over actions. The goal is to find a distribution for each player such that no one wants to switch.

The problem is storage. For an \(n\)-player game with \(m\) actions each, the payoff matrix contains \(nm^n\) entries. If you have 10 players and 10 actions, that is \(10 \times 10^{10}\) entries. You simply cannot load this into memory to calculate exact gradients.

The Stochastic Optimization Challenge

Because we cannot look at the whole game matrix, we use Sampled Play. We simulate the game, look at the outcome, and try to update our neural network (which represents our strategy) to do better next time. This is “stochastic optimization.”

Here lies the trap. To use ML optimization (like Stochastic Gradient Descent - SGD), we need a Loss Function—a mathematical formula that tells the network how “bad” its current strategy is.

Biased Methods: Many traditional loss functions measure “Exploitability” (how much could an opponent win against me?). However, calculating exploitability involves a max operator (finding the best counter-strategy). In statistics, the expectation of a max is not the same as the max of an expectation (\(\mathbb{E}[\max] \neq \max[\mathbb{E}]\)). This introduces bias, meaning even with infinite data, our algorithm might converge to the wrong solution.
The High-Variance Unbiased Method: Recently, researchers (Gemp et al., 2024) found an unbiased loss function. However, estimating it required taking the inner product of two independent random variables.

If the variance of one estimate is \(\sigma\), the variance of the product of two estimates scales roughly by \(\sigma^2\).
In practice, this creates incredibly “noisy” gradients. The optimizer jumps around erratically, leading to slow convergence or failure.

The Core Method: Nash Advantage Loss (NAL)

The researchers behind this paper proposed a novel surrogate loss function called Nash Advantage Loss (NAL).

The Key Insight: Gradient over Loss

The researchers realized that to train a neural network using SGD, you don’t actually need an unbiased estimate of the loss value itself; you only need an unbiased estimate of the gradient.

If we can construct a function whose gradient points toward the Nash Equilibrium without bias, it doesn’t matter if the function itself looks different from standard exploitability metrics.

Mathematical Foundation

The derivation starts with a property of the Nash Equilibrium. In an interior NE (where every action is played with some probability), the gradient of the utility function with respect to any action used must be equal. If one action had a higher marginal utility, the player would shift probability to it, breaking the equilibrium.

The authors utilize a simple but powerful lemma:

For any vector \(\mathbf{b}\) and any probability vector \(\mathbf{y}\), the equation \(\mathbf{b} - \langle \mathbf{b}, \mathbf{y} \rangle \mathbf{1} = \mathbf{0}\) holds if and only if all coordinates of \(\mathbf{b}\) are equal.

In plain English: If you subtract the weighted average of a vector from the vector itself and get zero, then all elements in that vector must have been identical.

Defining NAL

This leads to the definition of the Nash Advantage Loss. Instead of trying to calculate a complex squared distance between full gradients (which causes variance), NAL formulates the gradient directly using a “Stop-Gradient” operator.

Here is the formula for NAL:

The Nash Advantage Loss function equation.

Let’s break down this dense equation:

\(\boldsymbol{F}_i^{\tau, \boldsymbol{x}}\): This represents the negative gradient of the player’s utility (adjusted with an entropy regularization term \(\tau\) to ensure the strategy stays in the interior of the simplex).
\(\hat{\boldsymbol{x}}_i\): This is an arbitrary “reference” strategy. It acts as a baseline.
\(\langle \dots \rangle\): This denotes an inner product. The term \(\langle \boldsymbol{F}, \hat{\boldsymbol{x}} \rangle\) calculates the average gradient value under the reference strategy.
\(\mathbf{1}\): A vector of ones.
\(sg[\dots]\): This is the most critical part. It stands for Stop-Gradient.

The Role of Stop-Gradient: In modern deep learning frameworks (like PyTorch or TensorFlow), the stop_gradient operator allows a value to pass through during the “forward pass” (calculating the loss value) but prevents any gradients from flowing back through it during the “backward pass” (updating the weights).

By wrapping the complex gradient estimation terms inside sg[], the authors ensure that the optimization engine treats them as fixed targets during the update step. The only part of the equation that participates in the gradient calculation is the outer \(\boldsymbol{x}_i\). This creates a scenario where the gradient of NAL becomes:

The first-order gradient of NAL.

This gradient is zero exactly when the strategy \(\boldsymbol{x}\) is a Nash Equilibrium.

Why does this reduce variance?

The previous state-of-the-art unbiased loss function (Gemp et al., 2024) looked like this:

The previous unbiased loss function by Gemp et al. (2024).

Notice the squared norm \(\| \dots \|^2\). To estimate this unbiasedly via sampling, one essentially has to multiply two independent estimates of the gradient vectors. If your sampling creates noise (variance), squaring that noise makes it significantly worse.

NAL avoids this squared inner product of random variables. It relies on a single random variable estimate for the gradient vector \(\boldsymbol{F}\).

Additionally, the term \(\langle \boldsymbol{F}, \hat{\boldsymbol{x}} \rangle \mathbf{1}\) acts as a Control Variate. In variance reduction literature, a control variate is a term you subtract from an estimator to reduce its variance without changing its expected value. By subtracting the “average” gradient, NAL centers the gradients, keeping the numbers smaller and less volatile.

The Algorithm in Practice

Implementing NAL involves a specific sampling routine to ensure unbiasedness.

Sampling Actions: For every iteration, the algorithm samples actions from the current strategy \(\boldsymbol{x}^\theta\).
Importance Sampling: Because we only observe payoffs for the actions we actually sampled, we must use importance sampling (dividing by the probability of selecting that action) to reconstruct the full gradient vector.
Payoff Estimation: The algorithm queries the game simulator (or environment) to get the payoff \(r_i\).

The expected payoff calculation looks like this:

Payoff expectation equation.

And the estimation of the baseline term (the inner product) is averaged over \(S\) samples:

Baseline estimation expectation.

Finally, the constructed gradient estimate \(\mathbf{g}_i^s\) uses the difference between the observed payoff and the baseline, weighted by the importance sampling probability \(p\):

Gradient estimate equation.

Because this relies on linear estimates rather than quadratic ones (products of estimates), the variance is theoretically bounded by \(O(\sigma)\) rather than the \(O(\sigma^2)\) of previous methods.

Experiments & Results

Theory is nice, but does it work in practice? The authors tested NAL on eight different games using the OpenSpiel and GAMUT frameworks, including Kuhn Poker, Liar’s Dice, and Goofspiel.

The strategies were represented by Deep Neural Networks (3-layer MLPs). The optimizer used was Adam.

1. Convergence Speed

The primary metric for success in these experiments is the Duality Gap. A duality gap of 0 means the strategy is a perfect Nash Equilibrium. We want this curve to hit zero as fast as possible.

The figure below compares NAL (Blue lines, “Ours”) against three competitors:

Gemp et al., 2024: The previous unbiased (but high variance) loss.
ADI & NashApr: Biased loss functions.

Convergence rate comparison graphs across 8 games.

Interpretation: Look at the blue lines (“Ours”). In almost every game, NAL descends rapidly and achieves a lower duality gap much faster than the competitors.

In Goofspiel and Minimum Effort, the previous unbiased method (Orange line) fails to converge entirely, likely because the variance is just too high for the optimizer to handle. NAL converges smoothly.
The biased methods (Brown and Red lines) sometimes plateau early or fail to converge, illustrating the danger of bias in stochastic settings.

2. Variance Reduction

This is the claim the paper makes in its title, so the results here are critical. How much less “noisy” is NAL compared to the Gemp et al. (2024) loss?

Variance comparison graphs on log scale.

Interpretation: Note that the Y-axis is on a Logarithmic Scale.

The variance of the baseline unbiased method (Orange) is consistently 2 to 6 orders of magnitude higher than NAL (Blue).
In Liar’s Dice (top right), the gap is massive. This explains why NAL can optimize so much more efficiently—it is receiving a clear signal direction, whereas the competitor is drowning in noise.

3. Unbiasedness Verification

To ensure that NAL is indeed unbiased in practice, the authors plotted the difference between the estimated loss value (from sampling) and the true loss value (calculated exactly). Ideally, this difference should center around zero.

Difference between true and estimated values.

Interpretation: The blue line (“Ours”) stays tightly clustered around zero (or remarkably low values) compared to the other methods, confirming the unbiased nature of the estimator. The error for NAL is typically two orders of magnitude smaller than other functions.

Robustness

The authors didn’t just stop at one configuration. They tested:

Different Optimizers: SGD and RMSProp (NAL still won).
Different Architectures: Replacing Softmax with Sparsemax. In this setting, competitor algorithms frequently became unstable or output NaN (Not a Number) values due to variance spikes, while NAL remained stable.
Ablation Studies: They tested NAL without the baseline term (\(\langle F, \hat{x} \rangle\)). The results deteriorated, proving that the variance reduction from this specific term is essential for the method’s success.

Conclusion and Implications

The “Nash Advantage Loss” paper presents a significant step forward in using Machine Learning to solve large-scale games. By moving away from minimizing a direct loss function and instead designing a loss function that produces low-variance, unbiased gradients, the authors have overcome a major hurdle in stochastic optimization for games.

Key Takeaways:

Bias vs. Variance: You usually have to pick one. NAL manages to keep the estimator unbiased while drastically reducing variance through the use of control variates and avoiding squared random variables.
The Power of Stop-Gradient: Creative use of the stop-gradient operator allows us to define “surrogate” losses that guide the optimizer correctly without requiring complex, variance-inducing calculations in the forward pass.
Scalability: Because NAL works under sampled play (stochastic optimization), it opens the door to approximating Nash Equilibria in games that are far too large for traditional tabular methods.

For students and practitioners in Multi-Agent Reinforcement Learning (MARL) or Algorithmic Game Theory, NAL offers a new standard for training equilibrium solvers. It turns the chaotic, noisy process of learning from samples into a stable, converging path toward optimal strategies.

Introduction#

The Background: Why is this Hard?#

Normal-Form Games and the Curse of Dimensionality#

The Stochastic Optimization Challenge#

The Core Method: Nash Advantage Loss (NAL)#

The Key Insight: Gradient over Loss#

Mathematical Foundation#

Defining NAL#

Why does this reduce variance?#

The Algorithm in Practice#

Experiments & Results#

1. Convergence Speed#

2. Variance Reduction#

3. Unbiasedness Verification#

Robustness#

Conclusion and Implications#