Introduction
Imagine a delivery drone navigating a busy city. It has been trained on thousands of hours of flight data—blue skies, clear landing pads, and standard obstacles. But today, the world throws a curveball. As the drone approaches its destination, it encounters a construction site that wasn’t there yesterday. There is a crane moving unpredictably, a person balancing on a ladder, and yellow caution tape fluttering in the wind.
To a human, the danger is obvious: “Don’t fly near the person on the ladder.” But to a classical robotic control system, these are just undefined obstacles or, worse, confusing sensor noise. This is the Out-of-Distribution (OOD) problem. The robot is in a situation it wasn’t explicitly trained for, and its standard operating procedures might lead to a catastrophic failure.
We are currently witnessing a revolution in Foundation Models (FMs), such as Large Language Models (LLMs) and Vision-Language Models (VLMs). These models possess the “common sense” to understand that a fire on a roof or a chemical spill is dangerous. However, there is a catch: they are slow. Querying a VLM can take seconds—an eternity for a drone falling out of the sky or a quadruped robot slipping on debris.
How do we combine the semantic intelligence of foundation models with the split-second reaction times required for robotic safety?
This is the question answered by the paper “Real-Time Out-of-Distribution Failure Prevention via Multi-Modal Reasoning.” The researchers introduce FORTRESS, a framework that bridges the gap between slow, high-level reasoning and fast, low-level control. In this post, we will tear down how FORTRESS allows robots to “think” about safety ahead of time so they can “act” safely in real-time.

The Core Problem: Intelligence vs. Speed
To understand why this research is significant, we need to look at the current landscape of robotic safety.
Traditionally, safety is handled via hard-coded constraints. You tell the robot, “Keep 2 meters away from all objects.” This is fast and verifiable, but it lacks nuance. It doesn’t know that staying 2 meters away from a concrete wall is fine, but staying 2 meters away from a chaotic fire might not be.
On the other end of the spectrum, we have Semantic Reasoning using models like GPT-4 or Gemini. You could show the model an image and ask, “Is it safe to land here?” The model might correctly say, “No, that is a construction zone.” However, integrating this into a control loop is dangerous because:
- Latency: If the robot needs to update its controls at 100Hz (100 times per second), it cannot wait for a 2-second API call.
- Hallucination: Generative models sometimes make things up.
- Dynamics-Agnostic: An LLM might suggest a safe landing spot that is physically impossible for the robot to reach without crashing.
FORTRESS (Failure Prevention in Real Time by Generating and Reasoning about Fallback Strategies) solves this by decoupling the reasoning (Slow) from the execution (Fast).
The FORTRESS Architecture
The framework operates on a “Slow-Fast” hierarchical approach, similar to human cognition—we have slow, conscious logical reasoning, and fast, instinctive muscle memory.

As shown in Figure 2 above, the system is split into two timelines:
- Nominal Operation (Low Frequency / Offline): While the robot is operating normally (or even before it deploys), it uses heavy foundation models to “daydream” about what could go wrong and identify safe fallback goals. It essentially pre-compiles semantic knowledge into mathematical functions.
- Runtime Response (High Frequency / Online): When a safety monitor triggers an alarm (e.g., “Something is wrong!”), the robot switches to a fallback mode. It uses the pre-computed functions to instantly generate a safe path, avoiding the semantic dangers identified earlier.
Let’s break down the three main modules of this process: Generating Goals, Reasoning about Safety, and Synthesizing Plans.
Module 1: Generating Fallback Goals (M1)
When a robot detects a failure, it needs a “Plan B.” It cannot just freeze in mid-air. It needs a specific location to go to—a fallback goal.
Abstract instructions like “land on a roof” are useless to a flight controller; it needs \((x, y, z)\) coordinates. FORTRESS uses Vision-Language Models (VLMs) like Molmo to translate semantic desires into physical coordinates.

As seen in Figure 3, the system feeds the VLM an image of the scene and a prompt like “Point to empty, horizontal building roofs.” The VLM returns pixel coordinates, which are then converted into 3D world coordinates using the robot’s depth sensors.
Why do this proactively? Because querying the VLM takes time. By running this process at a low frequency (e.g., every few seconds) during normal operation, the robot maintains a fresh cache of “emergency exits.” If a failure occurs, it simply grabs the nearest valid coordinate from the cache instantly.
Module 2: Reasoning About Semantic Safety (M2)
This is perhaps the most innovative part of the paper. How does a robot know that a “Person on a ladder” is dangerous if it has never seen one before?
The researchers use a technique called Failure Mode Anticipation. Instead of hard-coding a list of dangers, they ask an LLM to generate them.
Step 1: Anticipating Failure Modes
The robot provides an LLM with a description of its environment (e.g., “Drone in a city”). The LLM generates a list of potential semantic failure modes \(\Phi\), such as:
- “Worker Injury”
- “High Temperature”
- “Cable Entanglement”
Step 2: Embedding the Danger
Computers can’t calculate the distance to the concept of “Worker Injury.” But they can calculate distance in Embedding Space.
Embeddings allow us to represent text or images as vectors of numbers. Concepts that are semantically similar end up close together in this vector space. The system computes embeddings for the generated failure modes (\(e_\phi\)) and for “safe” normal data (\(\Omega_s\)).
To check if a new observation is dangerous, the system measures the Cosine Similarity between the current observation’s embedding (\(e_i\)) and the failure mode embedding (\(e_\phi\)).

If the current scene is semantically closer to “Worker Injury” than it is to “Safe Normal Operations,” the system flags it.
Step 3: Calibrating the Threshold
How close is too close? The researchers use Conformal Prediction to set a rigorous mathematical threshold.

In simple terms, this equation looks at the “safe” training data and finds a similarity score threshold (\(\Delta_\phi\)) such that the vast majority (e.g., 95%) of safe data falls below it.
Finally, this creates a Semantic Safety Cost Function \(\theta_\phi(x)\):

If \(\theta_\phi(x) > 0\), the state \(x\) is considered semantically unsafe. This function is fast to evaluate. It transforms the abstract, fuzzy fear of “danger” into a concrete number the robot can use in math equations.

Figure 4 illustrates this pipeline. On the left, the “slow” reasoner imagines failure modes. On the right, the “fast” runtime monitor uses those embeddings to instantly classify a “Person on a Ladder” as a hazard, distinguishing it from a generic (and safe) “Person” or “Ladder.”
Module 3: Synthesizing Fallback Plans (M3)
Now the robot has a list of physical goals (from M1) and a mathematical function that screams “HIGH COST” if a path gets too close to a semantic danger (from M2).
When a fallback is triggered, FORTRESS must generate a trajectory \(\tau\) (a sequence of states) that reaches the goal region while keeping the safety cost below zero. This is formulated as a Reach-Avoid Optimization problem:

The equation seeks a trajectory \(\tau^*\) that minimizes the maximum danger encountered along the path (\(\max \theta_h(x)\)).
- Constraint 1: Start where the robot is (\(x_1 = b\)).
- Constraint 2: Eventually reach the goal region (\(\mathcal{B}_\rho(g)\)).
- Constraint 3: Obey physics (system dynamics \(f\)).
To solve this in real-time, the authors use RRT (Rapidly-exploring Random Trees) guided by the semantic cost functions, followed by a trajectory smoothing controller (like LQR or MPC). Because the cost functions are just geometric checks in the embedding space (which is pre-calibrated), this planning happens in milliseconds.
Experiments and Results
The researchers validated FORTRESS across three distinct domains: synthetic data analysis, simulation, and real-world hardware.
1. Can Embeddings Detect “Semantic” Danger?
The first hypothesis was that embedding models (calibrated with generated failure modes) are better at detecting unsafe situations than naively prompting an LLM on the fly.
They tested various embedding models (like OpenAI’s text-embedding-3, Voyage AI, etc.) on datasets involving drones, boats, and autonomous vehicles.

The ROC curves in Figure 7 show the trade-off between True Positive Rate (catching real dangers) and False Positive Rate (false alarms). The results are impressive: robust models like OpenAI (purple line) and Voyage AI (gray line) achieve very high accuracy (Area Under Curve > 0.90). This proves that the calibrated embedding approach effectively distinguishes safe scenes from unsafe ones.
Interestingly, they found that generating more failure modes (up to ~50) improved accuracy, as shown below. The models need a rich vocabulary of “what can go wrong” to be effective.

2. Simulation Results (CARLA)
The team tested the full pipeline in the CARLA urban simulator. A drone had to execute an emergency landing in a city filled with hazards like firetrucks, crowds, and traffic.
They compared FORTRESS against:
- AESOP: A baseline that detects anomalies but plans naive paths to fixed goals.
- Safe-Lang: An approach that uses language for safety but relies on simpler object avoidance.

The results (Figure 6) were stark. FORTRESS achieved a 92.2% success rate, compared to 40% for AESOP. Why? AESOP would often plan a path that was geometrically possible but semantically disastrous (e.g., landing on a roof occupied by people). FORTRESS recognized the semantic attributes of the “people” and “fire” as high-cost zones and planned around them.
3. Real-World Hardware
Finally, they deployed the system on a Quadrotor drone and an ANYmal quadruped robot.
In one experiment, the ANYmal robot navigated a room under construction. The environment contained ladders, debris, and spills.

Figure 13 showcases the “Semantic Awareness” in action.
- Left: A person on a ladder is detected. The system maps this specifically to the “Worker Injury” failure mode.
- Right: Debris and spills are mapped to “Chemical Spill” or “Uncontrolled Access.”
Crucially, this detection happened in real-time. The heavy lifting of defining “Worker Injury” was done beforehand. During the run, the robot simply matched the camera stream to that pre-computed concept.
The researchers also demonstrated the system’s ability to handle dynamic replanning. In the sequence below, a drone initially plans to land on the ground. However, a “skydiver” (a moving semantic hazard) enters the airspace. The system detects the new high-cost region and instantly replans to a secondary goal (a building roof).

Conclusion
The “Open World” is messy. Robots cannot be pre-programmed for every possible combination of construction sites, weather events, or human behaviors. FORTRESS represents a significant step forward in making robots robust to this chaos.
The key takeaway is the power of hybrid architectures:
- Use Foundation Models for what they are good at: Broad, semantic reasoning and “imagining” potential failures (Slow Thinking).
- Use Control Theory for what it is good at: Fast, geometric precision and guarantee-based planning (Fast Thinking).
- Bridge them with Embeddings: Use the vector space as the common language between the two.
By caching the “wisdom” of large models into fast-to-compute cost functions, FORTRESS allows robots to exhibit common sense without sacrificing the reflexes needed to stay safe. As we push autonomous systems further into unstructured environments—from delivery drones to search-and-rescue quadrupeds—frameworks like this will be essential to ensure they don’t just work when things go right, but also recover safely when things go wrong.
For those interested in the computational specifics, the hardware experiments (using a Jetson Nano) showed that while the “Slow” reasoning query took several seconds, the “Fast” safety inference and planning occurred in milliseconds (approx 12ms for inference, 1.28s for full planning), making it viable for real-world deployment.

](https://deep-paper.org/en/paper/2505.10547/images/cover.png)