Introduction

Imagine unboxing a new robot, turning it on, and telling it to “go to the kitchen.” Thanks to recent advancements in foundation models and Vision-Language Models (VLMs), this is becoming a reality. Robots can now understand high-level instructions and navigate through environments they have never seen before.

However, there is a catch. While these modern AI models are excellent at understanding where to go based on visual context, they often lack a precise understanding of physical geometry. They might recognize a path to the kitchen but fail to notice the small cardboard box left in the hallway or a chair leg protruding from under a table. The result? Collisions.

Traditionally, fixing this required expensive sensors (like LiDAR) or extensive retraining of the neural networks on massive datasets containing every possible obstacle type. But what if we could make these “blind” navigation models safer without retraining them and without adding new hardware?

In this post, we dive into CARE (Collision Avoidance via Repulsive Estimation), a novel research paper presented at CoRL 2025. CARE is a plug-and-play module that acts as a safety layer for existing visual navigation models. By combining monocular depth estimation with a physics-inspired repulsive force method, CARE allows robots to dodge obstacles in real-time—even objects they have never seen before—significantly reducing collisions in out-of-distribution environments.

Comparison of trajectory outputs under out-of-distribution (OOD) obstacle settings.

The Problem: Great Generalization, Poor Safety

To understand why CARE is necessary, we first need to look at the state of current visual navigation. Models like ViNT (Visual Navigation Transformer) and NoMaD are “foundation models” for robotics. They take RGB images from a simple camera and output waypoints or trajectories for the robot to follow.

These models are trained on diverse datasets, allowing them to generalize well. They can navigate a sidewalk, an office, or a home without needing a map. However, they operate primarily on appearance-based reasoning. They learn that “floor looks traversable” and “walls look solid.”

The limitation arises when these models face Out-of-Distribution (OOD) scenes—environments that look different from their training data. This could be:

Unseen Objects: A random box, a pile of laundry, or strange furniture.
Different Camera Setups: Changing the field of view (FOV) or the height of the camera on the robot.

As shown in Figure 1 above, standard models (Panel c) often generate trajectories that clip or intersect with obstacles because they don’t explicitly calculate “how far away is that object?” They simply predict a path based on visual patterns. When the visual pattern is unfamiliar, the robot crashes.

The Solution: The CARE Framework

The researchers propose CARE as an “attachable” module. It sits between the navigation model and the robot’s motor controller. It doesn’t replace the advanced AI that knows how to get to the goal; instead, it acts as a reflex system that nudges the robot away from immediate danger.

The beauty of CARE is its simplicity and compatibility. It requires:

No new sensors: It uses the same RGB camera as the navigation model.
No fine-tuning: You don’t need to retrain the massive foundation model.
No heavy compute: It runs in real-time alongside the navigation policy.

How It Works

The CARE framework operates in a three-stage pipeline: Top-Down Range Estimation, Repulsive Force Estimation, and Safety-Enhancing Trajectory Adjustment.

Overview of CARE system architecture.

As illustrated in Figure 2, the system takes the RGB image and feeds it into two parallel streams. One stream is the original vision-based model (like ViNT) which says, “I want to go there.” The other stream is the CARE module, which says, “Wait, there’s something in the way, let’s adjust.”

Let’s break down the mathematical and logical steps of this pipeline.

Stage 1: Seeing Geometry from a Single Image

Since the system relies on a standard camera, it lacks the direct distance measurements provided by LiDAR or Depth cameras. To solve this, CARE uses a pretrained monocular depth estimation model (specifically UniDepthV2). This AI model looks at a flat 2D image and predicts a dense depth map, estimating how far away every pixel is.

Once the depth map is generated, CARE projects this data into a 3D point cloud and then flattens it into a top-down local map.

Top-down projection of estimated depth.

In Figure 3a, you can see this transformation. The left image is what the robot sees (a room with chairs). The right image is the top-down map generated by CARE. The black circle is the robot, and the colored dots represent obstacles detected from the depth map.

To filter out noise (like the ceiling or distant walls), the system discretizes the x-axis (width) into bins and selects the closest point ($z^*$) in each bin:

Equation for finding the closest obstacle point.

This results in a clean set of obstacle coordinates $\mathcal{O}$ relative to the robot’s position.

Stage 2: The Physics of Repulsion

Now that the robot knows where the obstacles are, how does it avoid them? CARE borrows a concept from classical robotics called Artificial Potential Fields (APF).

Imagine the robot’s goal is a magnet pulling it forward (attractive force), and every obstacle is a magnet pushing it away (repulsive force).

CARE looks at the trajectory proposed by the navigation model ($\mathbf{p}_k$) and calculates the repulsive force exerted by every detected obstacle ($\mathbf{o}_m$). The formula relies on the inverse square law—the closer an obstacle is, the exponentially stronger it pushes the robot away.

Equation for repulsive force calculation.

The system identifies which point along the planned path is experiencing the strongest repulsive force—essentially finding the “most dangerous” part of the trajectory:

Equation for finding the waypoint with maximum repulsion.

Based on the direction of this force, CARE calculates an adjustment angle, $\theta_{rep}$.

Stage 3: Trajectory Adjustment and Safe-FOV

Instead of discarding the original path, CARE rotates it. It applies the rotation angle $\theta_{rep}$ (clipped to a maximum safe limit) to the entire trajectory. This effectively steers the robot around the obstacle while trying to maintain the general direction of the original goal.

Figure 3b (in the image panel above) visualizes this elegantly. The yellow path is the original trajectory heading straight for a collision. The vectors show the repulsive force from the obstacle. The purple path is the result: a safe curve around the danger.

The “Safe-FOV” Mechanism

A major risk with camera-based navigation is the Field of View (FOV). If a robot turns sharply to avoid a box, it might turn blindly into a wall that was previously outside the camera’s frame.

To prevent this, CARE implements a Safe-FOV rule. It checks the desired heading change ($\theta_{des}$).

Equation for Safe-FOV motion control.

If the turn is sharp ($|\theta_{des}| > \theta_{thres}$): The robot stops moving forward ($v=0$) and only rotates in place. This allows the camera to pan over to the new area, revealing any hidden obstacles before the robot commits to moving there.
If the turn is gentle: The robot moves forward and turns simultaneously (standard steering).

Experimental Validation

The researchers didn’t just test this in simulation; they deployed CARE on three different real-world robot platforms with varying heights, camera angles, and wheelbases: LoCoBot, TurtleBot4, and RoboMaster S1.

Mobile robot platforms used for evaluation.

They conducted two primary types of experiments: Undirected Exploration and Image Goal-Conditioned Navigation.

Experiment 1: Undirected Exploration

In this test, the robots were placed in a confined space ($3.5m \times 2.8m$) filled with random boxes (unseen obstacles). The goal was simply to wander around as much as possible without crashing.

Diagram of exploration and navigation tasks.

The results were dramatic. Without CARE, the baseline model (NoMaD) frequently crashed into the boxes almost immediately. With CARE, the robots navigated safely for significantly longer distances.

Graph comparing distance traveled before collision.

As shown in Figure 5 and Table 1 below, the LoCoBot equipped with CARE traveled over 21 meters on average before a collision, compared to just 2 meters without it. That is a 10.7x improvement.

Table of mean distance traveled before collision.

The performance varied by robot—TurtleBot4 saw less improvement than LoCoBot. The authors attribute this to the camera setup. The LoCoBot has a wide-angle fisheye lens, allowing CARE to see obstacles even in its periphery. The TurtleBot4 has a narrower view, meaning obstacles could slip out of frame and cause collisions during turns.

Experiment 2: Reaching the Goal

The second experiment was more structured. The robots had to travel down a 24-meter corridor using a topological graph (a series of image landmarks) to reach a destination. The researchers cluttered the hallway with random obstacles that were not in the map.

Photos of the test environment hallways.

The metrics focused on specific trade-offs: Does safety make the robot slower? Does it make the path longer?

Table comparing navigation performance metrics.

Table 2 reveals the key takeaways:

Safety First: CARE significantly reduced collision counts. For the LoCoBot, it achieved 0 collisions across the board.
Arrival Rate: The success rate (reaching the goal) skyrocketed. For TurtleBot4 running the ViNT model, the arrival rate went from 70% to 100%.
Efficiency Cost: The path length and completion time increased slightly (by roughly 4-20%). This is expected behavior—taking a detour to avoid a box is naturally longer than crashing into it.

Handling Dynamic Obstacles

Perhaps the most impressive test involved dynamic obstacles—specifically, humans jumping in front of the robot. Standard models usually fail here because they don’t react fast enough to sudden changes in geometry.

Table showing collision rates with dynamic obstacles.

Table 3 shows the results of this “stress test.”

Corner-appear: A person jumps out from a corner.
Front-approach: A person walks directly at the robot.

In almost every baseline trial (NoMaD/ViNT), the robot collided with the human. With CARE, the collision rate dropped to 0/10. The repulsive force estimation reacts instantly to the depth change caused by the person’s legs, and the Safe-FOV mechanism forces the robot to stop or turn immediately.

Conclusion & Implications

The CARE paper highlights a critical gap in the current era of “foundation models” for robotics: generalization does not equal safety. While large models can understand context, they often struggle with precise spatial awareness in novel environments.

CARE offers a compelling solution because it bridges the gap between high-level learning and low-level reactive control.

It is accessible: You don’t need a $10,000 LiDAR sensor.
It is transferable: The same module worked on three different robots.
It is effective: Up to 100% collision reduction in specific tasks.

By integrating explicit depth estimation and physics-based repulsion, CARE gives visual navigation models the “reflexes” they need to survive in the real world. For students and researchers in robotics, this underscores the importance of hybrid systems—combining deep learning with classical robotics principles—to create machines that are not just smart, but also safe.

Introduction#

The Problem: Great Generalization, Poor Safety#

The Solution: The CARE Framework#

How It Works#

Stage 1: Seeing Geometry from a Single Image#

Stage 2: The Physics of Repulsion#

Stage 3: Trajectory Adjustment and Safe-FOV#

The “Safe-FOV” Mechanism#

Experimental Validation#

Experiment 1: Undirected Exploration#

Experiment 2: Reaching the Goal#

Handling Dynamic Obstacles#

Conclusion & Implications#