In the world of Reinforcement Learning (RL), we usually frame problems as maximizing the sum of rewards. You take an action, get a reward, and try to get as much of it as possible over time. But what if your goal isn’t just about accumulating points? What if you want an agent to explore an environment as diversely as possible, or imitate a human expert’s behavior distribution?

These complex goals fall under the umbrella of General-Utility Markov Decision Processes (GUMDPs). Unlike standard RL, where the objective is linear, GUMDPs allow for objective functions that depend on the statistics (or occupancy) of the agent’s behavior.

For years, a subtle assumption has persisted in the theory behind these processes: that we can evaluate an agent based on an infinite number of trials. A new research paper, The Number of Trials Matters in Infinite-Horizon General-Utility Markov Decision Processes, challenges this assumption. The researchers demonstrate that in these complex settings, the number of times you test the agent (the number of trials, \(K\)) fundamentally changes the calculated performance.

In this post, we will break down why this happens, the mathematics behind the mismatch, and when you need to worry about it.

The Problem: Beyond Standard Rewards

To understand the paper’s contribution, we first need to establish the difference between standard RL and GUMDPs.

Standard MDPs vs. GUMDPs

In a standard Markov Decision Process (MDP), the goal is simple: maximize the expected sum of rewards. The objective function is linear. Because of the mathematical property of linearity, the expected return of a policy is the same whether you calculate it over one trial or average it over a million trials (mathematically speaking, the expectation of the sum is the sum of the expectations).

However, many modern AI problems are more nuanced. Consider these examples:

  1. Pure Exploration: You want an agent to visit all states equally. This requires maximizing the entropy of the state visitation distribution.
  2. Imitation Learning: You want the agent’s behavior distribution to match a teacher’s distribution (minimizing the KL-divergence).

These are General-Utility MDPs. Here, the objective function \(f\) is applied to the state-action occupancy \(d_\pi\) (the long-term frequency of visiting states and actions).

Illustrative GUMDPs showing entropy maximization, imitation learning, and quadratic minimization.

Figure 1: Examples of GUMDPs. (a) Entropy maximization requires visiting states diversely. (b) Imitation learning tries to match a target policy \(\beta\). (c) Quadratic minimization penalizes certain distribution shapes.

The Infinite vs. Finite Trap

Here is where the trouble begins. Theoretical work often assumes we are optimizing the infinite trials objective, denoted as \(f_\infty\). This assumes we know the true expected occupancy of the policy, effectively averaging over infinite episodes.

\[ \operatorname* { m i n } _ { \pi } f _ { \infty } ( \pi ) = \operatorname* { m i n } _ { \pi } f ( d _ { \pi } ) = \operatorname* { m i n } _ { \pi } f \left( \mathbb { E } _ { \mathcal { T } _ { K } } \left[ \hat { d } _ { \mathcal { T } _ { K } } \right] \right) , \]

In practice, however, we can only run a finite number of trials (\(K\)). We collect data from \(K\) trajectories, calculate the empirical occupancy \(\hat{d}\), and then plug that into our objective function. This is the finite trials objective, \(f_K\).

\[ \operatorname* { m i n } _ { \pi } f _ { K } ( \pi ) = \operatorname* { m i n } _ { \pi } \mathbb { E } _ { \mathcal { T } _ { K } } \left[ f ( \hat { d } _ { \mathcal { T } _ { K } } ) \right] , \]

The Core Question: Is minimizing \(f_K\) (what we do in practice) the same as minimizing \(f_\infty\) (what we want in theory)?

In standard RL, the answer is yes. In GUMDPs, as this paper proves, the answer is generally no.

The Mathematical Root: Jensen’s Inequality

Why is there a mismatch? It comes down to the convexity of the objective function \(f\).

If \(f\) is linear (standard RL), the expectation of the function equals the function of the expectation. However, most GUMDP objectives (like Entropy or KL-divergence) are convex functions.

According to Jensen’s Inequality, for a convex function \(f\):

\[ \begin{array} { r } { f _ { \infty } ( \pi ) = f ( d _ { \pi } ) = f \left( \mathbb { E } _ { \mathcal { T } _ { K } \sim \zeta _ { \pi } } \left[ \hat { d } _ { \mathcal { T } _ { K } } \right] \right) } \\ { \leq \mathbb { E } _ { \mathcal { T } _ { K } \sim \zeta _ { \pi } } \left[ f \left( \hat { d } _ { \mathcal { T } _ { K } } \right) \right] = f _ { K } ( \pi ) , } \end{array} \]

This inequality implies that \(f_\infty(\pi) \leq f_K(\pi)\). The estimated cost over finite trials will effectively be an overestimation of the true infinite-trial cost. This bias means that a policy that looks optimal for \(K=10\) trials might not be optimal for \(K=1000\) trials.

Scenario 1: Discounted GUMDPs

The authors first analyze the Discounted Setting, where we care about the frequency of visits weighted by a discount factor \(\gamma\). This is common in tasks where immediate actions matter more than distant ones.

In this setting, the paper proves that the number of trials always matters if the objective is non-linear. The empirical distribution of a single trajectory is highly stochastic; it depends heavily on where the agent starts and the randomness of transitions.

The authors derive a lower bound for the error between the finite and infinite objectives. Notice how the error is inversely proportional to \(K\) (the number of trials):

\[ \begin{array} { l } { { f _ { K } ( \pi ) - f _ { \infty } ( \pi ) \displaystyle \geq \frac { c } { 2 K } \displaystyle \sum _ { \stackrel { s \in S } { a \in \mathcal { A } } } \mathrm { V a r } _ { { { \sim } \zeta _ { \pi } } } \left[ \hat { d } _ { T } ( s , a ) \right] } } \\ { { { } } } \\ { { { } = \displaystyle \frac { c ( 1 - \gamma ) ^ { 2 } } { 2 K } \displaystyle \sum _ { \stackrel { s \in \mathcal { S } } { a \in \mathcal { A } } } \mathrm { V a r } _ { { { \sim } \zeta _ { \pi } } } \left[ J _ { r _ { s , a } } ^ { \gamma , \pi } \right] , } } \end{array} \]

What this means:

  1. The error is driven by the variance of the occupancy. If your environment is very noisy, the gap between what you measure and the truth is larger.
  2. The error decreases as \(1/K\). To cut the error in half, you need double the trials.
  3. \(c\) represents how “curved” (strongly convex) your objective function is. The more complex the objective, the higher the error.

Scenario 2: Average GUMDPs

The second setting is the Average Setting, where we look at the long-term average behavior of the agent as time goes to infinity (\(H \to \infty\)). Here, the results are more nuanced and depend on the structure of the environment.

Unichain vs. Multichain

The paper distinguishes between two types of MDP structures:

  1. Unichain GUMDPs: The whole environment is “connected.” No matter where you start or what you do, you can eventually reach any part of the state space that matters. There is only one “recurrent class.”
  2. Multichain GUMDPs: The environment has separated islands. Depending on your starting state or early choices, you might get trapped in one region (Recurrent Class A) and never see another (Recurrent Class B).

The Verdict

The authors present a striking dichotomy for the average setting:

  • For Unichain GUMDPs: The number of trials does not matter. Because the agent runs for infinite time (\(H \to \infty\)) in a connected world, a single trajectory eventually visits every state according to the true distribution. The empirical distribution converges to the true distribution, so \(f_K = f_\infty\).

  • For Multichain GUMDPs: The number of trials matters. If you only run \(K=1\) trial, the agent might get stuck in “Island A.” The empirical distribution will look like “100% Island A.” But the true expectation might be “50% Island A, 50% Island B.” Because the objective \(f\) is non-linear, the average of the function outcomes is not the same as the function of the average outcome.

The table below summarizes these findings:

Table summarizing that infinite and finite trial formulations are equivalent for linear objectives, but for non-linear objectives, they differ in discounted settings and multichain average settings.

For Multichain models, the authors prove a similar lower bound on the error, which depends on the variance of the probability of getting absorbed into different recurrent classes:

\[ \begin{array} { l } { { f _ { K } ( \pi ) - f _ { \infty } ( \pi ) \ge } } \\ { { \displaystyle \frac { c } { 2 K } \sum _ { l = 1 } ^ { L } V a r _ { B \sim B e r ( \alpha _ { l } ) } \left[ B \right] \sum _ { s \in \mathcal { R } _ { l } } \sum _ { a \in \mathcal { A } } \pi ( a | s ) ^ { 2 } \mu _ { l } ( s ) ^ { 2 } } , } \end{array} \]

This equation essentially says: if the starting state determines your destiny (high variance in which “island” you end up in), your single-trial evaluation will be wrong.

Empirical Evidence

The authors validated their theoretical bounds using the environments shown in Figure 1.

Discounted Setting Results

In the figure below, we see the value of the objective function (y-axis) as the trajectory length \(H\) increases. The dashed black line is the “true” infinite-trial value (\(f_\infty\)).

Graph showing empirical study of discounted GUMDPs. As K increases (red lines), the value approaches the true f_pi (black dashed line).

Notice that for low \(K\) (blue lines), the estimated value is far from the black dashed line. As \(K\) increases to 10 (red lines), the gap closes. This confirms that in discounted settings, you simply need enough trials to get an accurate evaluation.

Average Setting Results (Multichain)

The results for the average setting are perhaps the most interesting. The plot below shows what happens as the discount factor \(\gamma \to 1\) (approaching the average setting).

Graph showing empirical study of average GUMDPs. For M_f,3, even as gamma approaches 1, the gap remains for low K.

Look at the third panel (\(\mathcal{M}_{f,3}\)). This is a multichain environment. Even as \(\gamma\) approaches 1.0, the curves do not converge to the black dashed line for low \(K\). This confirms that for multichain environments, simply running the agent for a long time (high \(H\)) isn’t enough; you also need high \(K\) (many trials) to smooth out the variance caused by different starting conditions.

Why This Matters

This research highlights a “hidden cost” in complex Reinforcement Learning. When we move away from simple rewards to sophisticated goals like diversity or imitation, we lose the luxury of linearity.

If you are a student or researcher working with GUMDPs, take note:

  1. Evaluation Protocol: You cannot rely on a single rollout to evaluate your policy, even if that rollout is very long.
  2. Environment Structure: Know your MDP. Is it Unichain or Multichain? If it’s Multichain, your evaluation strategy needs to account for the different “islands” of the state space.
  3. Sample Efficiency: The \(1/K\) convergence rate implies that high-precision evaluation might be computationally expensive.

The number of trials isn’t just a hyperparameter to be tuned; in General-Utility MDPs, it’s a fundamental component of the objective function itself.