When Robots Hallucinate: Making AI Safe in an Uncertain World

Imagine you are playing a high-stakes game of Jenga. You carefully tap a block, analyzing how the tower wobbles. You predict that if you pull it slightly to the left, the tower remains stable. If you pull it to the right, it crashes. Your brain is running a “world model”—simulating the physics of the tower to keep you safe from losing.

Now, imagine a robot trying to do the same thing. To handle complex visual data (like a Jenga tower seen through a camera), modern robots use latent world models. These are AI systems that compress high-dimensional camera images into compact representations and “imagine” future outcomes.

But there is a catch. AI models are only as good as their training data. If the robot encounters a situation it has never seen before—an “Out-of-Distribution” (OOD) scenario—its world model might hallucinate. It might confidently predict that knocking the tower over is perfectly safe, simply because it doesn’t understand the physics of that specific angle.

How do we stop robots from acting on these dangerous hallucinations?

In this post, we dive into the research paper “Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures”. We will explore how researchers at Carnegie Mellon University developed UNISafe, a framework that teaches robots to recognize when they are confused and proactively steer back to safety.

The Problem: The Confidence of Ignorance

To understand the solution, we first need to understand the architecture of modern robot learning.

1. The Latent World Model

Robots operating in the real world deal with massive amounts of data—pixels in a video feed. Processing raw pixels for every decision is computationally expensive and inefficient. Instead, researchers train a Latent World Model.

Encoder: Compresses the image observation (\(o_t\)) into a compact latent vector (\(z_t\)).
Dynamics Model: Predicts the next latent state (\(z_{t+1}\)) given an action (\(a_t\)).

This allows the robot to “dream” or “imagine” sequences of future states to plan its actions.

2. The Safety Filter

Mere planning isn’t enough; we need guarantees. This is where Hamilton-Jacobi (HJ) Reachability comes in. It’s a control-theoretic method that calculates a “backward reachable set”—the set of all states from which a failure (like a crash) is inevitable, no matter what you do.

Safety Value Function (\(V\)): A score representing how safe a state is. If \(V < 0\), you are doomed.
Safety Filter: If the robot’s intended action leads to a state where \(V\) is too low, the filter overrides it with a safe backup action.

The OOD Trap

Here is where the system breaks down. Classical safety filters assume the dynamics model is perfect. But learned world models are not.

Figure 2: WM imaginations can lead to OOD Failures.

As shown in Figure 2 above, consider a simple robot car (Dubins car) trying to avoid a purple failure region.

True Outcome: If the car drives into the purple zone, it fails.
Predicted Outcome: The world model hasn’t seen enough data about the purple zone. When it imagines driving there, it goes “Out-of-Distribution” (OOD). Instead of predicting a crash, it “hallucinates” that the car teleports to a safe area.

Because the model predicts safety (incorrectly), the standard safety filter lets the robot drive straight into disaster. The robot isn’t just wrong; it is confidently wrong.

The Solution: UNISafe

The core insight of the UNISafe (UNcertainty-aware Imagination for Safety filtering) framework is simple but powerful: Treat the unknown as a failure.

If the robot’s world model is highly uncertain about a future state, the safety filter should treat that state as just as dangerous as hitting a wall. To achieve this, the authors propose a three-step pipeline:

Figure 1: Left: Quantification and Calibration. Center: Reachability. Right: Runtime Execution.

Let’s break down the three phases illustrated in Figure 1:

Quantify Uncertainty: Measure how much the model “doesn’t know.”
Calibrate: Determine a threshold for how much uncertainty is too much.
Augment & Filter: Build a safety filter that avoids both known failures (collisions) and unknown failures (high uncertainty).

Step 1: Quantifying Epistemic Uncertainty

Not all uncertainty is created equal.

Aleatoric Uncertainty: Inherent noise in the system (e.g., a slippery floor).
Epistemic Uncertainty: Lack of knowledge (e.g., “I have never been in this room before”).

We want to detect Epistemic Uncertainty. To do this, the authors use an Ensemble of latent dynamics models. They train multiple independent models to predict the future.

If the models agree, the data is likely familiar (In-Distribution).
If the models disagree, the robot is in uncharted territory (OOD).

They measure this disagreement using the Jensen-Rényi Divergence (JRD):

Equation for JRD and Epistemic Uncertainty

In this equation, the term \(D(z_t, a_t)\) represents the epistemic uncertainty. It effectively subtracts the average internal noise (aleatoric) from the total uncertainty of the mixture, leaving us with a pure measure of “model disagreement.”

Step 2: Calibrating with Conformal Prediction

Okay, we have an uncertainty score. But is a score of 0.5 high? Is 100 high? Arbitrary thresholds are dangerous.

The authors use Conformal Prediction, a statistical technique that uses a calibration dataset to rigorously determine a threshold, \(\epsilon\).

Probability bound for Conformal Prediction

This equation guarantees that for “normal” (in-distribution) data, the uncertainty will stay below \(\epsilon\) with a high probability (e.g., 95%). If the uncertainty crosses this line during operation, we can statistically assert that we are witnessing an Out-of-Distribution event.

Step 3: Uncertainty-Aware Reachability

This is the heart of the method. The researchers augment the robot’s latent state to include the uncertainty: \(\tilde{z} = (z, u)\).

They then redefine the Safety Margin Function (\(\ell\)). Usually, this function just asks, “Did I hit an obstacle?” Now, it asks, “Did I hit an obstacle OR is my uncertainty too high?”

Uncertainty-aware failure set equation

Here, \(\ell_\Xi\) is the new safety margin. It takes the minimum of the physical safety (\(\ell_z\)) and the uncertainty margin (\(\epsilon - u_t\)).

Finally, they solve the Safety Bellman Equation in this new augmented space. This trains a Value Function (\(V^\mathfrak{N}\)) that learns to spot the “Point of No Return”—not just for crashes, but for entering confusing situations.

Bellman Equation for Safety Value Function

Runtime Execution

When the robot is running, the safety filter watches the task policy (the robot’s main brain).

Runtime Safety Filtering Logic

As defined above, if the filter predicts that the proposed action will lead to a state where the Value (\(V^\bullet\)) is too low (meaning a crash or high uncertainty is inevitable), it intervenes.

The final logic looks like this:

Filtering Policy Equation

Check: Is the future safe and certain? \(\rightarrow\) Proceed.
Else: Is the future dangerous or confusing? \(\rightarrow\) Override with the safety policy \(\pi^\nabla\).
Critical Failure: Is even the safety policy uncertain? \(\rightarrow\) HALT.

Does It Work? Experiments & Results

The team tested UNISafe against standard methods (like LatentSafe, which ignores epistemic uncertainty) in three environments.

1. The Dubins Car (Simulation)

In this setup, a car must drive around a 2D plane while avoiding a “known” failure zone. However, the training data only covers specific parts of the map.

Figure 4: UNISafe vs LatentSafe visualizations.

Look at Figure 4.

LatentSafe (Blue): The model doesn’t know about the empty spaces in the map. It assumes they are safe. The “Approx. unsafe set” is too small, leading to a high False Positive Rate (FPR)—it thinks unsafe states are safe.
UNISafe (Orange): It correctly identifies the “Unknown” regions (where data is sparse) as dangerous. The safety filter blocks the car from entering these unmapped areas, effectively creating a “fence” around the known safe world.

2. Block Plucking (Vision-based Manipulation)

A robotic arm must pluck a block from a stack without toppling it. This is tricky physics.

Figure 5: Block Plucking sequence.

In Figure 5, notice the difference in foresight:

LatentSafe: Wait until the very last second. By the time it realizes the stack is falling, the momentum is too great. The physics model hallucinated stability until it was too late.
UNISafe: Detects that the action is moving the system toward a state the model doesn’t understand well (OOD). It triggers the red “Unsafe” warning before the failure becomes irreversible.

The authors also tested a “Hard” setting with different physics (friction/mass).

Figure 13: Qualitative results on the Hard setting.

As seen in Figure 13, in the “Hard” setting, the standard task policy (No Filter) causes the block to fall. LatentSafe tries to intervene but does so clumsily, eventually dropping the block because it overestimates its own competence. UNISafe proposes a correction early, stabilizing the block safely.

3. Playing Jenga (Real Hardware)

The ultimate test: A human teleoperating a real Franka Emika robot to play Jenga. The human might try risky moves that the robot hasn’t been trained on.

Figure 6: Teleoperator Playing Jenga.

Figure 6 shows the real-world performance.

Panel 3: When the human operator tries a risky, unfamiliar move, UNISafe intervenes to keep the block within the “In-Distribution” safe zone.
Panel 4: The graph at the bottom shows the uncertainty spiking. The moment the robot’s imagination goes OOD (the “Unknown” zone), the uncertainty crosses the threshold, and the filter activates.

Crucially, the system knows when to quit.

Figure 8: Halting on OOD Visual Inputs.

In Figure 8, we see the Halt mechanism.

Top Row: The target block changes color, but it’s still similar enough to training data. The model remains confident (\(D(z)\) stays low), and operation continues.
Bottom Row: The visual input is drastically different (major OOD). The model’s uncertainty skyrockets. The safety filter realizes that no action is safe because it’s effectively blind. It triggers a HALT to prevent damage.

The Numbers

The quantitative results reinforce the visual evidence.

Figure 7: Bar chart of Failure Rates.

In Figure 7, when filtering the task policy on hardware, LatentSafe had a failure rate of over 80%. UNISafe dropped this to under 10%. By acknowledging its own ignorance, the robot became significantly safer.

Conclusion

The “black box” nature of deep learning is often cited as a safety risk. If we don’t know why a neural network makes a decision, how can we trust it?

UNISafe offers a compelling answer: We don’t need to perfectly understand the black box, provided the black box knows when it is confused. By combining Generative World Models with Epistemic Uncertainty Quantification and Control Theory, we can build robots that are:

Capable: Leveraging visual data and latent imagination.
Humble: Recognizing when a situation is new and potentially dangerous.
Safe: Using rigorous math to steer back to the familiar.

This research bridges the gap between the messy, data-driven world of modern AI and the rigorous, safety-critical world of robotics. As we push robots into more open-ended environments—from self-driving cars to home assistants—giving them the ability to say “I’m not sure, so I’m going to play it safe” is a massive step forward.

This post is based on the paper “Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures” by Junwon Seo, Kensuke Nakamura, and Andrea Bajcsy from Carnegie Mellon University.

When Robots Hallucinate: Making AI Safe in an Uncertain World#

The Problem: The Confidence of Ignorance#

1. The Latent World Model#

2. The Safety Filter#

The OOD Trap#

The Solution: UNISafe#

Step 1: Quantifying Epistemic Uncertainty#

Step 2: Calibrating with Conformal Prediction#

Step 3: Uncertainty-Aware Reachability#

Runtime Execution#

Does It Work? Experiments & Results#

1. The Dubins Car (Simulation)#

2. Block Plucking (Vision-based Manipulation)#

3. Playing Jenga (Real Hardware)#

The Numbers#

Conclusion#