Taming Uncertainty in MCTS with Wasserstein Barycenters and Power Means

Monte-Carlo Tree Search (MCTS) is the engine behind some of the most impressive feats in modern AI, most notably the superhuman performance of AlphaGo and AlphaZero in games like Go and Chess. These algorithms work by building a search tree of possibilities, simulating future outcomes, and backing up the values to make the best decision at the root.

But there is a catch. Traditional MCTS excels in deterministic environments—where moving a piece to square E4 always puts the piece on E4. However, the real world is messy. It is stochastic (actions have random outcomes) and partially observable (we can’t see everything). In these “fog of war” scenarios, standard MCTS struggles. It often relies on simple averages to estimate the value of a state, effectively smoothing over the risks and high-variance outcomes that define stochastic environments.

In this post, we are diving deep into a new approach called Wasserstein MCTS (W-MCTS). This method re-imagines the nodes in a search tree not as single numbers, but as probability distributions. By using sophisticated tools from Optimal Transport theory—specifically \(L^1\)-Wasserstein barycenters—it propagates not just the expected value, but the uncertainty of that value all the way up the tree.

The Problem: When Averages Lie

In standard MCTS (like the popular UCT algorithm), when the search visits a state, it runs a simulation, gets a reward, and updates the “mean value” of that state. If a state leads to a reward of +100 half the time and -100 the other half, the mean is 0. A state that always returns 0 also has a mean of 0.

To a standard MCTS, these two states look identical. But to an agent in a risky environment, they are drastically different. One is a safe haven; the other is a coin toss with your life.

This inability to distinguish between “safe” and “risky” or to handle high variance causes standard MCTS to fail in stochastic environments. It leads to unstable estimates and poor exploration-exploitation balancing.

The Solution: Wasserstein MCTS

The researchers propose a fundamental shift: Model every node in the tree as a Gaussian distribution.

Instead of storing just a mean value \(\mu\), we store a mean and a standard deviation \((\mu, \sigma)\). This allows the algorithm to track how “sure” it is about a specific state’s value.

The magic lies in how these distributions are combined. When a parent node looks at its children, it doesn’t just average their numbers. It computes the Wasserstein Barycenter of the children’s distributions.

1. The Mathematical Foundation: Optimal Transport

To understand how W-MCTS combines distributions, we need to talk about the Wasserstein distance. Often called the “Earth Mover’s Distance,” it measures the cost to transform one probability distribution into another.

The general definition for the \(L^q\)-Wasserstein distance between distributions \(\mu\) and \(\nu\) is:

Equation for Wasserstein Distance

Here, \(\Gamma(\mu, \nu)\) represents all possible ways to transport mass from distribution \(\mu\) to \(\nu\). The “Barycenter” is simply the distribution that minimizes the total Wasserstein distance to a set of other distributions. It is the geometric “center of mass” in the space of probability distributions.

Equation for Wasserstein Barycenter

While the \(L^2\)-Wasserstein distance is common in machine learning, this paper opts for the \(L^1\)-Wasserstein distance combined with \(\alpha\)-divergence.

2. Why \(L^1\) and \(\alpha\)-divergence?

The authors choose the \(L^1\) metric because it is more robust to outliers and large deviations, which are common in stochastic reinforcement learning. Furthermore, they pair it with a specific cost function called the \(\alpha\)-divergence.

The \(\alpha\)-divergence is a family of distance measures that allows us to tune how we compare two values. It is defined as:

Equation for alpha-divergence

By adjusting the parameter \(\alpha\), we can change the behavior of the distance metric. This combination leads to the specific \(L^1\)-Wasserstein formulation used in the paper:

L1 Wasserstein definition

This might look heavy, but it leads to a beautiful and practical result which we discuss next: the Power Mean.

The Core Contribution: The Power Mean Backup

The most significant contribution of this paper is proving that if you assume the nodes are Gaussians, and you calculate the \(L^1\)-Wasserstein barycenter using \(\alpha\)-divergence, the complex optimization problem collapses into a clean, closed-form solution.

The updated mean and standard deviation of a parent node become the Power Mean of its children.

Let’s define a parameter \(p = 1 - \alpha\). The update rules for the mean (\(\overline{m}\)) and standard deviation (\(\overline{\delta}\)) of a value node are:

Power mean update rules

Interpreting the Power Mean (\(p\))

This result is powerful because it unifies different backup strategies into a single parameter \(p\):

If \(p = 1\): The update becomes a standard arithmetic Average. This is risk-neutral.
If \(p \to \infty\): The update behaves like a Max operator. This is highly optimistic (like the standard Bellman optimality equation).
If \(p\) is moderate: It interpolates between the average and the max.

This allows W-MCTS to act as a “Generalized Mean” algorithm. In highly stochastic environments where “max” is dangerous (overestimating lucky transitions), you can set \(p\) closer to 1. In deterministic environments where you want to find the optimal path, you can set \(p\) higher.

Generalized to Particle Filters

The paper notes that this math isn’t just for Gaussians. If you model your uncertainty using particles (samples), the update rule remains structurally identical:

Particle update rule

The W-MCTS Algorithm in Action

Now that we have the math to combine distributions, how does the actual MCTS loop work?

Step 1: Distributional Backup

When the search reaches a leaf and backs up values, it updates both the mean (\(V_m\)) and the standard deviation (\(V_{std}\)) estimates using the power mean formulas derived above.

Mean and Std Dev definitions

The backup operators use the visit counts \(n(s,a)\) to weight the contributions of different paths. This ensures that the variance at the root node accurately reflects the uncertainty propagated from deep in the tree.

Backup Operator Equations

Step 2: Action Selection (Exploration)

Standard MCTS uses the UCT formula (Upper Confidence Bound) to select actions. W-MCTS introduces two statistically grounded ways to select actions using the propagated distributions.

Strategy A: Optimistic Selection (W-MCTS-OS)

This is similar to UCT but replaces the standard exploration bonus with the propagated standard deviation \(\sigma\). This means exploration is driven by the actual uncertainty of the subtree, not just the visit count.

Optimistic Selection Equation

Strategy B: Thompson Sampling (W-MCTS-TS)

Thompson Sampling is a powerful Bayesian approach. Instead of calculating a bound, we sample a value from the distribution of each action \(\mathcal{N}(m, \sigma^2)\) and pick the action with the highest sample.

Thompson Sampling Equation

This method naturally balances exploration and exploitation. If a node has high variance (uncertainty), it might produce a high sample, prompting the agent to explore it. As the variance decreases, the samples cluster tighter around the mean.

Theoretical Guarantees

One of the major strengths of this work is its rigorous theoretical backing. Many modifications to MCTS lack convergence proofs. The authors prove that W-MCTS, specifically the Thompson Sampling variant, converges to the optimal policy.

The error of the estimated value function at the root node decreases at a polynomial rate:

Convergence Bound

This \(\mathcal{O}(n^{-1/2})\) rate matches the known lower bounds for this class of problems, confirming that propagating uncertainty doesn’t come at the cost of asymptotic optimality.

Experimental Results

The theory sounds great, but does it work? The authors tested W-MCTS on challenging domains that break standard MCTS: highly stochastic MDPs (like FrozenLake and RiverSwim) and Partially Observable MDPs (Pocman, Rocksample).

Fully Observable Stochastic Environments

In environments like FrozenLake (where the ice is slippery and movement is random) and RiverSwim, the agent must plan long sequences of actions despite constant noise.

The figure below compares W-MCTS (Red line) against standard UCT (Blue dashed) and other baselines like DNG (Bayesian MCTS).

Performance Graph on Stochastic MDPs

Analysis:

W-MCTS-TS (Red) consistently learns faster and achieves higher returns than UCT.
Robustness: In SixArms and Taxi, standard UCT struggles significantly, while W-MCTS maintains high performance. This confirms that propagating variance helps the agent avoid “traps” that look good on average but are risky.

Partially Observable Environments (POMDPs)

POMDPs are notoriously difficult because the agent doesn’t know the true state. It only has a belief. The authors tested on Rocksample (a rover collecting samples) and Pocman (Pacman with limited vision).

Performance Graph on Rocksample

In Rocksample (Figure 2), W-MCTS-TS (Red) dominates the baselines, especially as the grid size and difficulty increase (moving from left to right). The gap between W-MCTS and standard UCT (Blue) is massive, highlighting that standard averages simply cannot handle the uncertainty of partial observability.

Table 1 highlights the results in Pocman.

Table of Pocman Results

W-MCTS-TS achieves a score of 77.70, significantly higher than UCT’s 28.5. It even outperforms D2NG, a specialized Bayesian algorithm for POMDPs.

Conclusion

Wasserstein MCTS represents a significant step forward in planning algorithms. By accepting that the world is uncertain and modeling that uncertainty explicitly with Gaussian distributions and Optimal Transport, it bridges the gap between the “average” case and the “worst/best” case.

The key takeaways are:

Don’t ignore variance: Propagating the standard deviation through the tree allows for smarter exploration.
Power Means are flexible: The \(p\) parameter (derived from \(\alpha\)-divergence) allows the algorithm to adapt its behavior from risk-neutral to optimistic.
Optimal Transport is a practical tool: The use of \(L^1\)-Wasserstein barycenters provides a mathematically sound way to average distributions that results in simple, implementable update rules.

For students and researchers in RL, this paper illustrates how bringing concepts from other mathematical fields—like Optimal Transport—can solve fundamental limitations in classic AI algorithms.

Taming Uncertainty in MCTS with Wasserstein Barycenters and Power Means#

The Problem: When Averages Lie#

The Solution: Wasserstein MCTS#

1. The Mathematical Foundation: Optimal Transport#

2. Why \(L^1\) and \(\alpha\)-divergence?#

The Core Contribution: The Power Mean Backup#

Interpreting the Power Mean (\(p\))#

Generalized to Particle Filters#

The W-MCTS Algorithm in Action#

Step 1: Distributional Backup#

Step 2: Action Selection (Exploration)#

Strategy A: Optimistic Selection (W-MCTS-OS)#

Strategy B: Thompson Sampling (W-MCTS-TS)#

Theoretical Guarantees#

Experimental Results#

Fully Observable Stochastic Environments#

Partially Observable Environments (POMDPs)#

Conclusion#