Why Robots Should Navigate by Objects, Not Just Images: Introducing ObjectReact

Visual navigation is one of the holy grails of robotics. We want robots to enter a new environment, look around, and navigate to a goal just like a human would. However, the current dominant paradigm—using raw images to estimate control—has a significant flaw: it is obsessed with the robot’s specific perspective.

If you train a robot to navigate a hallway using a camera mounted at eye level, and then you lower that camera to knee height (simulating a smaller robot), the navigation system often breaks completely. The images look different, even though the geometry of the hallway hasn’t changed.

In this deep dive, we are exploring a fascinating paper titled “ObjectReact: Learning Object-Relative Control for Visual Navigation.” The researchers propose a shift from image-relative navigation to object-relative navigation. Instead of memorizing pixels, the robot learns to react to the objects in its environment.

By the end of this post, you will understand how representing the world as a “WayObject Costmap” allows robots to take shortcuts, reverse their paths, and swap physical bodies without losing their way.

To understand the innovation here, we first need to look at the status quo. Most modern visual topological navigation systems work on a “teach-and-repeat” basis.

Mapping: The robot is driven through an environment, saving a sequence of images.
Navigation: To get from point A to point B, the robot looks at its current camera view and compares it to the next “subgoal” image from its memory.
Control: A neural network predicts the velocity needed to make the current view look like the subgoal image.

This is called image-relative control. It works well for simple retracing of steps. However, it is brittle. Because images are tied strictly to the robot’s pose (where it is and where it’s looking) and embodiment (how tall it is), the system struggles if:

The robot is shorter or taller than the one that mapped the area.
The robot needs to take a shortcut that wasn’t in the original path.
The robot needs to navigate the path in reverse (images look totally different when you turn 180 degrees).

The core insight of ObjectReact is simple but profound: Objects are properties of the map, not the robot.

A chair is a chair, whether you view it from 1.5 meters high or 0.5 meters high. A door is a door, whether you approach it from the left or the right. By anchoring navigation to objects rather than whole images, we can create a system that is invariant to the robot’s specific physical configuration or exact trajectory.

The researchers propose a pipeline that decouples the “what is this?” (perception) from the “how do I move?” (control).

The Four Challenging Tasks

To prove the necessity of this shift, the authors define four navigation tasks that break traditional image-based methods.

As shown in Figure 1 above:

Imitate: The standard “teach-and-repeat.” Image-relative methods are good at this.
Alt-Goal: The robot must reach an object seen previously, but never visited directly.
Shortcut: The robot realizes the goal is close and skips a long loop it took during training.
Reverse: The robot must navigate the mapped path backward.

Image-relative methods fail hard on the last three because the visual inputs during execution don’t match the stored sequence of images. ObjectReact aims to solve them all.

The ObjectReact Methodology

The ObjectReact pipeline consists of three main phases: Mapping, Execution (Planning), and Training (Control). Let’s break down the architecture.

1. Mapping: The Relative 3D Scene Graph

Instead of a dense metric map (like a heavy LiDAR point cloud) or a pure topological map (just a string of images), the authors build a Relative 3D Scene Graph (3DSG).

Nodes: These are objects detected in the video stream. The system uses foundational models like SAM (Segment Anything Model) or FastSAM to extract segmentation masks (shapes of objects).
Intra-image Edges (Inside a frame): The system estimates the relative 3D distance between objects in the same image. For example, “the chair is 1.5 meters from the table.” This creates a local geometric understanding.
Inter-image Edges (Between frames): As the robot moves, it tracks objects across frames using feature matching (SuperPoint/LightGlue). If the “red chair” in frame 1 is the same as the “red chair” in frame 2, they are connected with zero cost.

This graph structure allows the robot to understand how objects connect spatially without needing a precise global GPS.

2. Execution: Planning with “WayObjects”

When the robot needs to move, it doesn’t just look for a target image.

Localize: It identifies objects in its current view.
Match: It matches these current objects to nodes in its map.
Plan: It runs a standard pathfinding algorithm (Dijkstra’s algorithm) on the graph.

Here is the twist: The planner calculates the distance to the goal for every single object currently visible.

If the robot sees a door, a plant, and a desk, the planner might say:

Door: 2 meters from goal.
Plant: 15 meters from goal.
Desk: 50 meters from goal.

The robot now knows that the Door is the most attractive “WayObject.”

3. The Core Innovation: The WayObject Costmap

This is the most critical part of the paper. How do you feed “distance to goal” into a neural network controller? You could give it a list of numbers, but that loses spatial information.

The authors create a WayObject Costmap. They take the segmentation mask of the current view (the outlines of all visible objects) and fill each object’s shape with its calculated path length (cost).

Low Cost (Darker/Different Color): Objects on the shortest path.
High Cost: Objects far away or leading in the wrong direction.
Outliers: Objects that aren’t in the map (obstacles or dynamic objects) get a maximum cost.

Instead of a simple scalar value, they encode these costs using a high-dimensional sine-cosine embedding, similar to Positional Encodings in Transformers.

Equation defining the WayObject Costmap and the sinusoidal encoding of path lengths.

In this equation:

\(\mathcal{W}\) is the costmap.
\(\mathbf{M}\) represents the binary masks of the objects.
\(\mathbf{E}\) is the positional encoding of the path length \(l\).

This turns a 2D image of objects into a rich, semantic “heatmap” of traversability and desire.

4. The Controller: Learning to React

Finally, the ObjectReact controller is a neural network trained to look at this WayObject Costmap and predict velocity (linear and angular).

Crucially, the controller does NOT take the RGB image as input. It only looks at the Costmap.

Why? If you feed the RGB image, the network might overfit to visual textures (“drive towards the brown carpet”). By forcing it to use the Costmap, the network learns a general rule: “steer towards low-cost regions, avoid high-cost regions.” This is what allows it to generalize to new environments and embodiments.

Experiments and Results

The authors evaluated ObjectReact using the Habitat-Matterport 3D (HM3D) dataset, comparing it against GNM (a state-of-the-art image-relative method).

Performance on Challenging Tasks

Table 1 reveals the stark difference in capability.

Table 1: Comparing image-relative and object-relative controllers across four navigation tasks.

Imitate: Both methods perform well (~58-59% SPL). This is expected; standard methods are designed for this.
Alt Goal: ObjectReact dominates (21.74 vs 2.17 SPL). The image-relative method fails because it never stored an image “looking at” the alternative goal. ObjectReact just calculates a new path on the graph.
Shortcut: ObjectReact sees a nearby object with a low graph cost and takes the shortcut. The image-based method blindly follows the long recorded loop.
Reverse: ObjectReact wins again. Objects (like a sofa) can be recognized from the back, allowing the graph to orient the robot even when traveling backward.

The Embodiment Test (Robot Height)

This is arguably the most practical win for real-world robotics. The authors trained the map using a sensor height of 1.3m (like a human or tall robot). They then tested the robot with a sensor at 0.4m (like a robot dog).

Table 2: Effect of Embodiment (height) variations during execution for fixed map height of 1.3m

Looking at Table 2:

Image Relative: Performance collapses from 81.82 SPL to 33.33 SPL when the height changes. The pictures just don’t match anymore.
Object Relative: Performance is stable (actually slightly better at 60.60 SPL vs 57.56 SPL).

Because the objects (the nodes in the graph) are the same regardless of height, the WayObject Costmap remains consistent. The controller simply sees “low cost object on the left” and turns left, regardless of whether the camera is high or low.

Real-World Deployment

The authors deployed this on a Unitree Go1 robot dog. Note in Figure 4 how the Costmap (middle row) guides the robot.

Figure 4: Real-world Experiments. We deploy our approach on the Unitree Go1 robot dog. Here, we show egocentric RGB images, their corresponding WayObject Costmaps, and the predicted trajectory rollout at several timesteps during autonomous navigation to the goal object.

In the highlighted sequence:

t=5s: The robot identifies a region of lower-cost objects to the left and turns.
t=20s: It encounters an obstacle. Because the obstacle isn’t in the map (or is matched to a high-cost node), it appears as a “high cost” zone. The robot naturally navigates around the “low cost” visible space.

This behavior emerges without explicitly training for obstacle avoidance—it’s just “reacting” to the costmap.

Implications and Conclusion

The ObjectReact paper presents a compelling argument for moving away from purely image-based visual navigation. By introducing the Relative 3D Scene Graph and the WayObject Costmap, the authors have created a system that is:

Trajectory Invariant: It doesn’t need to follow the exact path of the training run (Shortcuts).
Embodiment Invariant: It works even if the camera height changes drastically.
Modular: The control problem is decoupled from the vision problem. You can upgrade the object detector (e.g., to a better version of SAM) without retraining the navigation controller.

Why This Matters for Students

For students entering robotics or computer vision, this paper illustrates the importance of representation.

Deep Learning isn’t just about throwing raw pixels into a CNN.
How you structure your data (Graphs vs. Sequences) and how you represent your inputs (Costmaps vs. RGB) defines the ceiling of your system’s intelligence.

ObjectReact suggests that for robots to truly understand navigation, they shouldn’t just look at the world; they should understand the things in it.

Why Robots Should Navigate by Objects, Not Just Images: Introducing ObjectReact

The Problem with Image-Relative Navigation