Taming Uncertainty in MCTS with Wasserstein Barycenters and Power Means
Monte-Carlo Tree Search (MCTS) is the engine behind some of the most impressive feats in modern AI, most notably the superhuman performance of AlphaGo and AlphaZero in games like Go and Chess. These algorithms work by building a search tree of possibilities, simulating future outcomes, and backing up the values to make the best decision at the root.
But there is a catch. Traditional MCTS excels in deterministic environments—where moving a piece to square E4 always puts the piece on E4. However, the real world is messy. It is stochastic (actions have random outcomes) and partially observable (we can’t see everything). In these “fog of war” scenarios, standard MCTS struggles. It often relies on simple averages to estimate the value of a state, effectively smoothing over the risks and high-variance outcomes that define stochastic environments.
In this post, we are diving deep into a new approach called Wasserstein MCTS (W-MCTS). This method re-imagines the nodes in a search tree not as single numbers, but as probability distributions. By using sophisticated tools from Optimal Transport theory—specifically \(L^1\)-Wasserstein barycenters—it propagates not just the expected value, but the uncertainty of that value all the way up the tree.
The Problem: When Averages Lie
In standard MCTS (like the popular UCT algorithm), when the search visits a state, it runs a simulation, gets a reward, and updates the “mean value” of that state. If a state leads to a reward of +100 half the time and -100 the other half, the mean is 0. A state that always returns 0 also has a mean of 0.
To a standard MCTS, these two states look identical. But to an agent in a risky environment, they are drastically different. One is a safe haven; the other is a coin toss with your life.
This inability to distinguish between “safe” and “risky” or to handle high variance causes standard MCTS to fail in stochastic environments. It leads to unstable estimates and poor exploration-exploitation balancing.
The Solution: Wasserstein MCTS
The researchers propose a fundamental shift: Model every node in the tree as a Gaussian distribution.
Instead of storing just a mean value \(\mu\), we store a mean and a standard deviation \((\mu, \sigma)\). This allows the algorithm to track how “sure” it is about a specific state’s value.
The magic lies in how these distributions are combined. When a parent node looks at its children, it doesn’t just average their numbers. It computes the Wasserstein Barycenter of the children’s distributions.
1. The Mathematical Foundation: Optimal Transport
To understand how W-MCTS combines distributions, we need to talk about the Wasserstein distance. Often called the “Earth Mover’s Distance,” it measures the cost to transform one probability distribution into another.
The general definition for the \(L^q\)-Wasserstein distance between distributions \(\mu\) and \(\nu\) is:

Here, \(\Gamma(\mu, \nu)\) represents all possible ways to transport mass from distribution \(\mu\) to \(\nu\). The “Barycenter” is simply the distribution that minimizes the total Wasserstein distance to a set of other distributions. It is the geometric “center of mass” in the space of probability distributions.

While the \(L^2\)-Wasserstein distance is common in machine learning, this paper opts for the \(L^1\)-Wasserstein distance combined with \(\alpha\)-divergence.
2. Why \(L^1\) and \(\alpha\)-divergence?
The authors choose the \(L^1\) metric because it is more robust to outliers and large deviations, which are common in stochastic reinforcement learning. Furthermore, they pair it with a specific cost function called the \(\alpha\)-divergence.
The \(\alpha\)-divergence is a family of distance measures that allows us to tune how we compare two values. It is defined as:

By adjusting the parameter \(\alpha\), we can change the behavior of the distance metric. This combination leads to the specific \(L^1\)-Wasserstein formulation used in the paper:

This might look heavy, but it leads to a beautiful and practical result which we discuss next: the Power Mean.
The Core Contribution: The Power Mean Backup
The most significant contribution of this paper is proving that if you assume the nodes are Gaussians, and you calculate the \(L^1\)-Wasserstein barycenter using \(\alpha\)-divergence, the complex optimization problem collapses into a clean, closed-form solution.
The updated mean and standard deviation of a parent node become the Power Mean of its children.
Let’s define a parameter \(p = 1 - \alpha\). The update rules for the mean (\(\overline{m}\)) and standard deviation (\(\overline{\delta}\)) of a value node are:

Interpreting the Power Mean (\(p\))
This result is powerful because it unifies different backup strategies into a single parameter \(p\):
- If \(p = 1\): The update becomes a standard arithmetic Average. This is risk-neutral.
- If \(p \to \infty\): The update behaves like a Max operator. This is highly optimistic (like the standard Bellman optimality equation).
- If \(p\) is moderate: It interpolates between the average and the max.
This allows W-MCTS to act as a “Generalized Mean” algorithm. In highly stochastic environments where “max” is dangerous (overestimating lucky transitions), you can set \(p\) closer to 1. In deterministic environments where you want to find the optimal path, you can set \(p\) higher.
Generalized to Particle Filters
The paper notes that this math isn’t just for Gaussians. If you model your uncertainty using particles (samples), the update rule remains structurally identical:

The W-MCTS Algorithm in Action
Now that we have the math to combine distributions, how does the actual MCTS loop work?
Step 1: Distributional Backup
When the search reaches a leaf and backs up values, it updates both the mean (\(V_m\)) and the standard deviation (\(V_{std}\)) estimates using the power mean formulas derived above.

The backup operators use the visit counts \(n(s,a)\) to weight the contributions of different paths. This ensures that the variance at the root node accurately reflects the uncertainty propagated from deep in the tree.

Step 2: Action Selection (Exploration)
Standard MCTS uses the UCT formula (Upper Confidence Bound) to select actions. W-MCTS introduces two statistically grounded ways to select actions using the propagated distributions.
Strategy A: Optimistic Selection (W-MCTS-OS)
This is similar to UCT but replaces the standard exploration bonus with the propagated standard deviation \(\sigma\). This means exploration is driven by the actual uncertainty of the subtree, not just the visit count.

Strategy B: Thompson Sampling (W-MCTS-TS)
Thompson Sampling is a powerful Bayesian approach. Instead of calculating a bound, we sample a value from the distribution of each action \(\mathcal{N}(m, \sigma^2)\) and pick the action with the highest sample.

This method naturally balances exploration and exploitation. If a node has high variance (uncertainty), it might produce a high sample, prompting the agent to explore it. As the variance decreases, the samples cluster tighter around the mean.
Theoretical Guarantees
One of the major strengths of this work is its rigorous theoretical backing. Many modifications to MCTS lack convergence proofs. The authors prove that W-MCTS, specifically the Thompson Sampling variant, converges to the optimal policy.
The error of the estimated value function at the root node decreases at a polynomial rate:

This \(\mathcal{O}(n^{-1/2})\) rate matches the known lower bounds for this class of problems, confirming that propagating uncertainty doesn’t come at the cost of asymptotic optimality.
Experimental Results
The theory sounds great, but does it work? The authors tested W-MCTS on challenging domains that break standard MCTS: highly stochastic MDPs (like FrozenLake and RiverSwim) and Partially Observable MDPs (Pocman, Rocksample).
Fully Observable Stochastic Environments
In environments like FrozenLake (where the ice is slippery and movement is random) and RiverSwim, the agent must plan long sequences of actions despite constant noise.
The figure below compares W-MCTS (Red line) against standard UCT (Blue dashed) and other baselines like DNG (Bayesian MCTS).

Analysis:
- W-MCTS-TS (Red) consistently learns faster and achieves higher returns than UCT.
- Robustness: In
SixArmsandTaxi, standard UCT struggles significantly, while W-MCTS maintains high performance. This confirms that propagating variance helps the agent avoid “traps” that look good on average but are risky.
Partially Observable Environments (POMDPs)
POMDPs are notoriously difficult because the agent doesn’t know the true state. It only has a belief. The authors tested on Rocksample (a rover collecting samples) and Pocman (Pacman with limited vision).

In Rocksample (Figure 2), W-MCTS-TS (Red) dominates the baselines, especially as the grid size and difficulty increase (moving from left to right). The gap between W-MCTS and standard UCT (Blue) is massive, highlighting that standard averages simply cannot handle the uncertainty of partial observability.
Table 1 highlights the results in Pocman.

W-MCTS-TS achieves a score of 77.70, significantly higher than UCT’s 28.5. It even outperforms D2NG, a specialized Bayesian algorithm for POMDPs.
Conclusion
Wasserstein MCTS represents a significant step forward in planning algorithms. By accepting that the world is uncertain and modeling that uncertainty explicitly with Gaussian distributions and Optimal Transport, it bridges the gap between the “average” case and the “worst/best” case.
The key takeaways are:
- Don’t ignore variance: Propagating the standard deviation through the tree allows for smarter exploration.
- Power Means are flexible: The \(p\) parameter (derived from \(\alpha\)-divergence) allows the algorithm to adapt its behavior from risk-neutral to optimistic.
- Optimal Transport is a practical tool: The use of \(L^1\)-Wasserstein barycenters provides a mathematically sound way to average distributions that results in simple, implementable update rules.
For students and researchers in RL, this paper illustrates how bringing concepts from other mathematical fields—like Optimal Transport—can solve fundamental limitations in classic AI algorithms.
](https://deep-paper.org/en/paper/15359_monte_carlo_tree_search_-1615/images/cover.png)