Introduction

Reinforcement Learning (RL) has achieved remarkable feats, from mastering complex strategy games to controlling robotic limbs. However, one bottleneck persistently stifles progress: efficient exploration. In environments where rewards are “sparse”—meaning the agent receives feedback only rarely, perhaps only upon completing a complex task—an agent can spend eons flailing randomly, never stumbling upon the specific sequence of actions required to earn a reward.

Imagine you are dropped into a massive, dark maze with a single treasure chest hidden somewhere deep inside. If you wander randomly, you might eventually find it, but it could take a lifetime. However, if you realized that passing through a doorway (a junction) opens up a whole new section of the maze, you would prioritize finding those doors. This structural knowledge is crucial.

This is the core premise of the research paper “Door(s): Junction State Estimation for Efficient Exploration in Reinforcement Learning.” The authors propose a novel heuristic called Door(s). Instead of exploring blindly or relying on complex models of control, this method seeks out “junction states”—metaphorical (and sometimes literal) doors that grant access to a vast number of future states.

(a) Heatmaps showing Door(s) values in grid-worlds compared to (b-e) Empowerment in Pendulum environments. Door(s) highlights narrow passages and achieves comparable results with less data.

As shown in Figure 1 above, the heuristic (a) specifically lights up narrow passages between rooms in a grid world. By prioritizing these bottlenecks, the agent can traverse the environment more efficiently than if it were trying to visit every state with equal probability.

In this post, we will dissect how this method works, the mathematics behind identifying these “doors” without human labeling, and how it outperforms existing methods like Empowerment in complex, continuous environments.

Background: The Quest for Intrinsic Motivation

To solve the sparse reward problem, researchers often turn to Intrinsic Motivation (IM). This gives the agent an internal reward signal—a sense of “curiosity” or “desire for mastery”—to guide it when the environment is silent.

There are several flavors of IM:

Novelty-based: The agent gets a reward for visiting a state it hasn’t seen before.
Information-theoretic: The agent maximizes the mutual information between states and actions.

A leading concept in the second category is Empowerment. Empowerment measures how much “control” an agent has over its future. A state is “empowered” if the agent can reach many different future states and select which one it ends up in via specific action sequences.

While theoretically sound, Empowerment has a major flaw: it requires accurate modeling of long sequences of actions (\(a_t, a_{t+1}, \dots, a_{t+H}\)). Predicting the exact outcome of a 100-step action sequence is notoriously difficult due to compounding errors. Consequently, Empowerment often struggles with long time horizons.

This is where Door(s) differentiates itself. The authors argue that we don’t necessarily need to know how (which actions) we use to get to a future state to know that a current state is valuable. We just need to know that the current state acts as a gateway to a diverse set of future possibilities.

The Core Method: Estimating Junction States

The goal of the Door(s) heuristic is to quantify the “dispersity” of states reachable from a current state \(s\) over a time horizon \(H\). If a state allows you to reach a highly diverse set of future locations, it is likely a junction or a bottleneck.

To formalize this, the authors build a mathematical framework centered on the State Occupancy Distribution.

Step 1: The Environment Model

First, we need to define the probability of transitioning from state \(s\) to state \(s'\) in exactly \(t\) steps. This is defined recursively.

Recursive definition of the t-step transition probability.

Here, \(\rho(s \xrightarrow{t} s')\) is the probability of landing in \(s'\) after \(t\) steps starting from \(s\). This sums over all intermediate states \(x\). Crucially, this probability marginalizes over actions—it looks at the natural dynamics of the environment and a uniform exploration policy rather than a specific learned policy.

Step 2: The State Occupancy Distribution

We are rarely interested in just where the agent is at exactly step \(t\). We care about where the agent spends its time over a window of time up to horizon \(h\). The authors define the State Occupancy Distribution, \(\Psi^{(h)}\), which represents the fraction of time spent in state \(s'\) given a starting state \(s\) over a horizon \(h\).

The state occupancy distribution equation.

This equation averages the transition probabilities over the time steps \(1\) to \(h\).

Step 3: The Door(s) Measure

Now we get to the heart of the heuristic. How do we convert this distribution into a single value that represents “door-ness”? We use Entropy.

In information theory, entropy measures the uncertainty or “spread” of a distribution. A distribution concentrated on a single point has low entropy (0). A distribution spread uniformly across all possible points has maximum entropy.

If a state is a “door,” passing through it should allow the agent to access a wide variety of states. Therefore, the state occupancy distribution starting from a door should have high entropy.

The Door(s) value for a state \(s\) is computed as the average entropy of the state occupancy distribution over multiple horizons ranging from 1 to \(H\).

The Door(s) equation: Average entropy of state occupancy over horizon H.

In this equation:

\(H\) is the maximum horizon (a hyperparameter).
\(\mathcal{H}^{(h)}\) is the entropy of the distribution \(\Psi^{(h)}\).

Why Multiple Horizons?

You might wonder, why average over all horizons \(h=1 \dots H\)? Why not just look at the maximum horizon \(H\)?

The dynamics of an environment change over time. A state might be a local bottleneck (important in the short term) or a global gateway (important in the long term). By averaging, the measure captures both.

Comparison of querying strategies: (a) Multiple horizons, (b) Single long horizon, (c) Centrality focus.

As shown in Figure 5 (above), querying multiple horizons (top row) provides a much cleaner signal. It accurately highlights the junctions (a), detects dead-ends (b), and prioritizes central states (c). Using only the final horizon (bottom row) often washes out the signal, making it harder to distinguish true junctions from open spaces.

Step 4: Implementation in Continuous Space

The math above works perfectly for a grid world where we can count discrete states. But in robotics, states are continuous vectors (positions, velocities, angles). We cannot sum over infinite states, nor can we compute exact entropy easily.

To solve this, the authors employ Mixture Density Networks (MDNs).

An MDN is a neural network that, instead of outputting a single value, outputs the parameters (means \(\mu\), variances \(\Sigma\), and weights \(\alpha\)) of a mixture of Gaussian distributions. This allows the network to approximate complex, multi-modal probability distributions.

The approximated state occupancy distribution \(\hat{\Psi}\) is defined as:

MDN approximation of the state occupancy distribution using Gaussian mixtures.

Here, the neural network takes the current state \(s\) and the horizon \(h\) as inputs and predicts where the agent might end up.

The network is trained by minimizing the negative log-likelihood of real trajectories collected by the agent:

Loss function for the MDN.

Once the MDN is trained, we can estimate the entropy. While the entropy of a Gaussian mixture doesn’t have a closed-form solution, it can be approximated efficiently. The authors use a weighted sum of the entropies of the individual Gaussian components:

Approximated entropy calculation for the mixture model.

This implementation detail is vital. It allows the Door(s) heuristic to scale to high-dimensional spaces (like a 30-dimensional robot arm state) where counting methods fail.

Comparing the continuous approximation to a discretized version confirms the method’s validity:

Comparison of discrete vs. continuous (MDN) approximations of Door(s).

In Figure 7, we see that the MDN approach (c) captures the same dynamics as the discrete truth (a) and the computationally expensive “query all horizons” approach (b), but with significantly better efficiency.

Experiments and Results

The authors tested Door(s) across various environments, including a simple Pendulum, a complex PointMaze, and a robotic Fetch arm manipulation task.

1. Visualizing the Reward Landscape

One of the most telling experiments compares the “heatmaps” generated by Door(s) versus Empowerment.

Comparison of Door(s) and Empowerment heatmaps in a maze.

In Figure 2, we see a maze environment.

Door(s) (Left): Clearly highlights the central intersections and high-velocity states. It assigns very low value to dead ends (purple areas).
Empowerment (Right): While it identifies some structure, it is inconsistent. It assigns high values to some corners and fails to clearly distinguish the main corridors.

This highlights the robustness of Door(s) over long horizons (\(H=500\)), whereas Empowerment struggles because the action-sequence modeling becomes unreliable at that depth.

2. The “Throwing” Insight

A fascinating result emerged in the FetchPickAndPlace task. In this environment, a robot arm must pick up an object.

Relationship between distance to object and heuristic value.

Figure 3 plots the heuristic value against the distance to the object.

Door(s) (Blue line): Shows high value even at large distances.
Empowerment (Orange/Green): Drops off as distance increases.

Why? The Door(s) heuristic realizes that if the robot throws the object, the object can visit a massive number of states (flying through the air, bouncing, rolling). This creates a high-entropy occupancy distribution. Empowerment, however, is based on control. Once the object leaves the gripper, the robot loses control over it. Therefore, Empowerment views throwing as “bad” (low value).

However, for exploration, throwing is excellent! It helps the agent learn the physics of the environment and how the object moves. This illustrates how Door(s) captures “potential influence” rather than strict control.

3. Exploration Efficiency

Does maximizing the Door(s) metric actually lead to better exploration?

Coefficient of Variation in state visitation (lower is better).

Figure 4(a) shows the “Coefficient of Variation” of state visitation in a maze. A lower value means the agent is visiting states more uniformly (good exploration).

Door(s) (Blue): Achieves the best (lowest) variation early in training, exploring the maze efficiently.
Empowerment (Orange): Performs similarly to a baseline with no intrinsic reward.

4. Downstream Learning (Transfer)

The ultimate test is whether this exploration helps the agent learn a specific task later. The researchers pre-trained an agent to simply maximize the Door(s) reward (pure exploration), and then fine-tuned it to solve specific tasks like Pushing or Sliding a puck.

Table of sample efficiency and convergence steps.

Table 1 reveals the results.

DIAYN: A popular skill-discovery method. It converges fast during pre-training but fails completely on the downstream tasks (0 successes).
Door(s): Converges quickly during pre-training and, crucially, provides consistent, successful results in the downstream Pick-and-Place, Push, and Slide tasks. It requires significantly fewer steps than learning from scratch.

This suggests that the behaviors learned by maximizing “Door-ness”—interacting with objects, passing through choke points—are highly reusable fundamental skills.

Limitations and Future Work

While promising, the method is not a silver bullet.

Stochasticity: In environments with random noise (e.g., a “noisy TV” that displays random static), the entropy measure might get tricked into thinking the noise is meaningful diversity.
Discontinuities: The MDN assumes smooth Gaussian transitions. It might struggle with teleportation or sudden, sharp boundaries in state space.
Computational Cost: While more efficient than Empowerment, estimating entropy via an MDN still requires significant compute, and the current implementation assumes uniform action distributions for the heuristic.

Conclusion

Door(s) offers a fresh perspective on the exploration problem in Reinforcement Learning. By shifting the focus from “control” (Empowerment) to “potential reachability” (Junction States), the authors provide a heuristic that is computationally tractable and robust over long time horizons.

The method effectively identifies the “keys” to the environment—the narrow passages and interaction points that unlock the rest of the state space. For students and researchers looking at intrinsic motivation, this paper demonstrates that sometimes, you don’t need to know exactly how to control every step of a journey; simply knowing which door to walk through is enough to open up a world of possibilities.

Introduction#

Background: The Quest for Intrinsic Motivation#

The Core Method: Estimating Junction States#

Step 1: The Environment Model#

Step 2: The State Occupancy Distribution#

Step 3: The Door(s) Measure#

Why Multiple Horizons?#

Step 4: Implementation in Continuous Space#

Experiments and Results#

1. Visualizing the Reward Landscape#

2. The “Throwing” Insight#

3. Exploration Efficiency#

4. Downstream Learning (Transfer)#

Limitations and Future Work#

Conclusion#