When Robots Hallucinate: Making AI Safe in an Uncertain World
Imagine you are playing a high-stakes game of Jenga. You carefully tap a block, analyzing how the tower wobbles. You predict that if you pull it slightly to the left, the tower remains stable. If you pull it to the right, it crashes. Your brain is running a “world model”—simulating the physics of the tower to keep you safe from losing.
Now, imagine a robot trying to do the same thing. To handle complex visual data (like a Jenga tower seen through a camera), modern robots use latent world models. These are AI systems that compress high-dimensional camera images into compact representations and “imagine” future outcomes.
But there is a catch. AI models are only as good as their training data. If the robot encounters a situation it has never seen before—an “Out-of-Distribution” (OOD) scenario—its world model might hallucinate. It might confidently predict that knocking the tower over is perfectly safe, simply because it doesn’t understand the physics of that specific angle.
How do we stop robots from acting on these dangerous hallucinations?
In this post, we dive into the research paper “Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures”. We will explore how researchers at Carnegie Mellon University developed UNISafe, a framework that teaches robots to recognize when they are confused and proactively steer back to safety.
The Problem: The Confidence of Ignorance
To understand the solution, we first need to understand the architecture of modern robot learning.
1. The Latent World Model
Robots operating in the real world deal with massive amounts of data—pixels in a video feed. Processing raw pixels for every decision is computationally expensive and inefficient. Instead, researchers train a Latent World Model.
- Encoder: Compresses the image observation (\(o_t\)) into a compact latent vector (\(z_t\)).
- Dynamics Model: Predicts the next latent state (\(z_{t+1}\)) given an action (\(a_t\)).
This allows the robot to “dream” or “imagine” sequences of future states to plan its actions.
2. The Safety Filter
Mere planning isn’t enough; we need guarantees. This is where Hamilton-Jacobi (HJ) Reachability comes in. It’s a control-theoretic method that calculates a “backward reachable set”—the set of all states from which a failure (like a crash) is inevitable, no matter what you do.
- Safety Value Function (\(V\)): A score representing how safe a state is. If \(V < 0\), you are doomed.
- Safety Filter: If the robot’s intended action leads to a state where \(V\) is too low, the filter overrides it with a safe backup action.
The OOD Trap
Here is where the system breaks down. Classical safety filters assume the dynamics model is perfect. But learned world models are not.

As shown in Figure 2 above, consider a simple robot car (Dubins car) trying to avoid a purple failure region.
- True Outcome: If the car drives into the purple zone, it fails.
- Predicted Outcome: The world model hasn’t seen enough data about the purple zone. When it imagines driving there, it goes “Out-of-Distribution” (OOD). Instead of predicting a crash, it “hallucinates” that the car teleports to a safe area.
Because the model predicts safety (incorrectly), the standard safety filter lets the robot drive straight into disaster. The robot isn’t just wrong; it is confidently wrong.
The Solution: UNISafe
The core insight of the UNISafe (UNcertainty-aware Imagination for Safety filtering) framework is simple but powerful: Treat the unknown as a failure.
If the robot’s world model is highly uncertain about a future state, the safety filter should treat that state as just as dangerous as hitting a wall. To achieve this, the authors propose a three-step pipeline:

Let’s break down the three phases illustrated in Figure 1:
- Quantify Uncertainty: Measure how much the model “doesn’t know.”
- Calibrate: Determine a threshold for how much uncertainty is too much.
- Augment & Filter: Build a safety filter that avoids both known failures (collisions) and unknown failures (high uncertainty).
Step 1: Quantifying Epistemic Uncertainty
Not all uncertainty is created equal.
- Aleatoric Uncertainty: Inherent noise in the system (e.g., a slippery floor).
- Epistemic Uncertainty: Lack of knowledge (e.g., “I have never been in this room before”).
We want to detect Epistemic Uncertainty. To do this, the authors use an Ensemble of latent dynamics models. They train multiple independent models to predict the future.
- If the models agree, the data is likely familiar (In-Distribution).
- If the models disagree, the robot is in uncharted territory (OOD).
They measure this disagreement using the Jensen-Rényi Divergence (JRD):

In this equation, the term \(D(z_t, a_t)\) represents the epistemic uncertainty. It effectively subtracts the average internal noise (aleatoric) from the total uncertainty of the mixture, leaving us with a pure measure of “model disagreement.”
Step 2: Calibrating with Conformal Prediction
Okay, we have an uncertainty score. But is a score of 0.5 high? Is 100 high? Arbitrary thresholds are dangerous.
The authors use Conformal Prediction, a statistical technique that uses a calibration dataset to rigorously determine a threshold, \(\epsilon\).

This equation guarantees that for “normal” (in-distribution) data, the uncertainty will stay below \(\epsilon\) with a high probability (e.g., 95%). If the uncertainty crosses this line during operation, we can statistically assert that we are witnessing an Out-of-Distribution event.
Step 3: Uncertainty-Aware Reachability
This is the heart of the method. The researchers augment the robot’s latent state to include the uncertainty: \(\tilde{z} = (z, u)\).
They then redefine the Safety Margin Function (\(\ell\)). Usually, this function just asks, “Did I hit an obstacle?” Now, it asks, “Did I hit an obstacle OR is my uncertainty too high?”

Here, \(\ell_\Xi\) is the new safety margin. It takes the minimum of the physical safety (\(\ell_z\)) and the uncertainty margin (\(\epsilon - u_t\)).
Finally, they solve the Safety Bellman Equation in this new augmented space. This trains a Value Function (\(V^\mathfrak{N}\)) that learns to spot the “Point of No Return”—not just for crashes, but for entering confusing situations.

Runtime Execution
When the robot is running, the safety filter watches the task policy (the robot’s main brain).

As defined above, if the filter predicts that the proposed action will lead to a state where the Value (\(V^\bullet\)) is too low (meaning a crash or high uncertainty is inevitable), it intervenes.
The final logic looks like this:

- Check: Is the future safe and certain? \(\rightarrow\) Proceed.
- Else: Is the future dangerous or confusing? \(\rightarrow\) Override with the safety policy \(\pi^\nabla\).
- Critical Failure: Is even the safety policy uncertain? \(\rightarrow\) HALT.
Does It Work? Experiments & Results
The team tested UNISafe against standard methods (like LatentSafe, which ignores epistemic uncertainty) in three environments.
1. The Dubins Car (Simulation)
In this setup, a car must drive around a 2D plane while avoiding a “known” failure zone. However, the training data only covers specific parts of the map.

Look at Figure 4.
- LatentSafe (Blue): The model doesn’t know about the empty spaces in the map. It assumes they are safe. The “Approx. unsafe set” is too small, leading to a high False Positive Rate (FPR)—it thinks unsafe states are safe.
- UNISafe (Orange): It correctly identifies the “Unknown” regions (where data is sparse) as dangerous. The safety filter blocks the car from entering these unmapped areas, effectively creating a “fence” around the known safe world.
2. Block Plucking (Vision-based Manipulation)
A robotic arm must pluck a block from a stack without toppling it. This is tricky physics.

In Figure 5, notice the difference in foresight:
- LatentSafe: Wait until the very last second. By the time it realizes the stack is falling, the momentum is too great. The physics model hallucinated stability until it was too late.
- UNISafe: Detects that the action is moving the system toward a state the model doesn’t understand well (OOD). It triggers the red “Unsafe” warning before the failure becomes irreversible.
The authors also tested a “Hard” setting with different physics (friction/mass).

As seen in Figure 13, in the “Hard” setting, the standard task policy (No Filter) causes the block to fall. LatentSafe tries to intervene but does so clumsily, eventually dropping the block because it overestimates its own competence. UNISafe proposes a correction early, stabilizing the block safely.
3. Playing Jenga (Real Hardware)
The ultimate test: A human teleoperating a real Franka Emika robot to play Jenga. The human might try risky moves that the robot hasn’t been trained on.

Figure 6 shows the real-world performance.
- Panel 3: When the human operator tries a risky, unfamiliar move, UNISafe intervenes to keep the block within the “In-Distribution” safe zone.
- Panel 4: The graph at the bottom shows the uncertainty spiking. The moment the robot’s imagination goes OOD (the “Unknown” zone), the uncertainty crosses the threshold, and the filter activates.
Crucially, the system knows when to quit.

In Figure 8, we see the Halt mechanism.
- Top Row: The target block changes color, but it’s still similar enough to training data. The model remains confident (\(D(z)\) stays low), and operation continues.
- Bottom Row: The visual input is drastically different (major OOD). The model’s uncertainty skyrockets. The safety filter realizes that no action is safe because it’s effectively blind. It triggers a HALT to prevent damage.
The Numbers
The quantitative results reinforce the visual evidence.

In Figure 7, when filtering the task policy on hardware, LatentSafe had a failure rate of over 80%. UNISafe dropped this to under 10%. By acknowledging its own ignorance, the robot became significantly safer.
Conclusion
The “black box” nature of deep learning is often cited as a safety risk. If we don’t know why a neural network makes a decision, how can we trust it?
UNISafe offers a compelling answer: We don’t need to perfectly understand the black box, provided the black box knows when it is confused. By combining Generative World Models with Epistemic Uncertainty Quantification and Control Theory, we can build robots that are:
- Capable: Leveraging visual data and latent imagination.
- Humble: Recognizing when a situation is new and potentially dangerous.
- Safe: Using rigorous math to steer back to the familiar.
This research bridges the gap between the messy, data-driven world of modern AI and the rigorous, safety-critical world of robotics. As we push robots into more open-ended environments—from self-driving cars to home assistants—giving them the ability to say “I’m not sure, so I’m going to play it safe” is a massive step forward.
This post is based on the paper “Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures” by Junwon Seo, Kensuke Nakamura, and Andrea Bajcsy from Carnegie Mellon University.
](https://deep-paper.org/en/paper/2505.00779/images/cover.png)