Introduction
Imagine teaching a robot to navigate a crowded office. In the past, this was a modular problem: one piece of software built a map, another located the robot on that map, and a third calculated a path. Today, the cutting edge of Embodied AI uses “end-to-end” (E2E) reinforcement learning. You feed the robot visual data (pixels), and it outputs motor commands (action). It’s a “black box” approach that has yielded impressive results in simulation.
But there is a catch. Most simulations treat robot movement like a video game character: precise, instant, and frictionless. If the algorithm says “stop,” the agent stops instantly. In the real world, however, physics gets in the way. Real robots have mass; they drift, skid, and take time to accelerate or brake. When you take a brain trained in a “perfect” simulation and put it in a messy physical body, it often fails—a problem known as the Sim2Real gap.
The research paper we are discussing today, Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach, tackles this problem head-on. The researchers didn’t just want to build a better robot; they wanted to open the “black box” and understand what these neural networks are actually learning.
Does an AI agent simply memorize “if I see a wall, turn left”? Or does it unknowingly become a physicist, learning an internal model of inertia and momentum? Through a massive study involving 262 navigation episodes on real robots, this paper reveals that these agents are doing something remarkably sophisticated: they are learning to predict the future and correct it with their senses, much like a classical control system, but learned entirely from scratch.

Background: The Evolution of Visual Navigation
To appreciate the findings of this paper, we need to understand the two main schools of thought in robotic navigation.
1. The Classical Robotics Approach: Traditionally, navigation is treated as a geometry problem. The robot uses sensors (LiDAR or cameras) to build a map (SLAM), localizes itself within that map, and then uses a path-planning algorithm (like A*) to find a route. Finally, a low-level controller (like a PID controller) tries to execute that path. This is robust but requires heavy engineering and often struggles with semantic “understanding” of a scene.
2. The Embodied AI Approach (RL): Here, an agent is trained via Reinforcement Learning (RL). It explores an environment millions of times, getting “rewards” for reaching a goal. The agent learns a policy—a function that maps observations directly to actions.
The Missing Link: Dynamics
Historically, RL agents were trained in simulators where movement was abstract—often just teleporting the agent 25cm forward per step. Consequently, when deployed on real robots, these agents were slow and jerky because they had to stop completely between steps to realign.
Recent advancements have integrated dynamical models into simulators. Instead of teleporting, the simulator calculates friction, acceleration, and response time. The researchers built upon previous work (Bono et al.) which showed that training with these realistic physics is key to Sim2Real transfer.

As shown in Figure 2, simply by improving the engineering (training longer, using better data augmentation, and incorporating realistic dynamics), the researchers achieved a 92.5% success rate in real-world tests, a massive jump over previous methods. But the performance isn’t the main story here—the analysis is.
Core Method: Dissecting the Agent
The researchers set out to probe the internal state of the trained agent. They wanted to know if the agent was effectively learning a Dynamical System.
In control theory, a dynamical system often uses a process called Prediction-Correction (similar to a Kalman Filter):
- Prediction (Open-Loop): Use an internal model of physics to guess where you will be next based on your current speed and action.
- Correction (Closed-Loop): Use your sensors (eyes/cameras) to see where you actually are and fix the error in your guess.
The hypothesis was that an end-to-end RL agent, forced to deal with momentum and drift, would naturally reinvent this process inside its neural network layers.
The Architecture
The agent uses a Recurrent Neural Network (specifically a GRU - Gated Recurrent Unit). It takes in several inputs at every time step \(t\):
- Visuals (\(I_t\)): RGB images processed by a ResNet.
- Depth (\(S_t\)): Lidar-like range data processed by a 1D-CNN.
- Goal (\(g_0\)): Where it needs to go.
- Odometry (\(\hat{p}_t\)): The robot’s estimated position from wheel encoders.
The core equation governing the agent’s “memory” or hidden state (\(h_t\)) is:

The researchers created two versions of this agent to compare:
- D28-instant: Trained in a simulation with “magic” instant movement (no inertia).
- D28-dynamics: Trained in a simulation with realistic physics (inertia, acceleration, drag).
The “Distance to Belief” Metric
To prove the agent learns physics, the researchers needed to test how sensitive it was to changes in the environment. If the robot enters a room with a slippery floor (changing the friction/damping), does it crash?
Comparing these changes is hard. How do you compare a 10% change in friction to a 10% increase in sensor noise? They introduced a novel metric called Distance to Belief (\(D_{belief}\)).

As illustrated in Figure 13, \(D_{belief}\) measures the physical discrepancy between a trajectory in the “perfect” training world and a “corrupted” world (e.g., one with more drag), assuming the robot took the exact same actions. This allows the researchers to normalize different types of disturbances.
By plotting the agent’s success rate against this \(D_{belief}\), they could see which agents were robust and which were fragile.

Figure 4 reveals a crucial insight:
- Left Graph (D28-dynamics): The curves for damping (blue) and response time (orange) stay high even as the physical properties change. This means the agent has “learned” generalized physics and can adapt. However, the purple line drops fast—meaning it relies heavily on its odometry (prediction) to navigate.
- Right Graph (D28-instant): The agent trained without physics falls apart quickly. It overfitted to the “teleportation” movement of the simulation.
Experiments & Results
The team conducted extensive probing to see what information was stored in the agent’s hidden memory vector (\(h_t\)).
1. Does the Agent Predict the Future?
If the agent has learned an internal physics model, it should be able to predict where it will be in the future, even without new visual input.
The researchers trained a “probe” (a small separate neural network) to look at the agent’s frozen memory state at time \(t\) and try to guess the robot’s position at time \(t+20\).

The results in Figure 7 are striking. The red dots (predictions based only on the agent’s memory) closely follow the black line (the actual future trajectory). This confirms that the agent effectively hallucinates a short-term future path, proving it has learned a latent dynamical model.
2. Does the Agent Map the Room?
The agent isn’t given a map. But does it build one in its head? The researchers again used a probe to see if they could reconstruct an occupancy map (a top-down view of walls and free space) from the hidden vector.

Figure 9 shows that the agent’s memory implicitly stores the geometry of the room. The reconstructed maps (bottom row) align very well with reality, even capturing details like doorways that are critical for navigation.
3. Does the Agent “Plan”?
Planning implies looking far ahead, weighing options, and making a choice. To test this, the researchers analyzed the Value Function (critic) of the RL agent. In Reinforcement Learning, the “value” is the agent’s estimation of how much reward it will get in the future.
If the agent is “planning,” we should see the value drop when it realizes a path is blocked and spike when it finds a new solution.

Figure 8 tells a fascinating story of a single episode:
- Point 3: The robot tries a path, but it’s blocked. The value (blue line) drops.
- Point 4: It tries to go North. Blocked by glass. Value drops further (goes negative!).
- Point 5: The robot “decides” to abandon this route and try a completely different door. The value spikes immediately. It hasn’t reached the goal yet, but it anticipates success.
This behavior suggests the agent isn’t just reacting to the pixel in front of it; it is maintaining a high-level plan and updating its confidence based on geometric reasoning.
4. The Limits: Tunnel Vision
Despite these capabilities, the agent isn’t a perfect planner. The study found evidence of “tunnel vision.” Because the agent relies on its internal memory and visual inputs, it sometimes commits to a path that a human (or a global map planner) would instantly recognize as a dead end.

As seen in Figure 15, the agent sometimes struggles with long-horizon geometric reasoning, getting stuck trying to navigate through obstacles that are clearly impassable, highlighting a limitation in its “common sense” reasoning capabilities.
Robustness and Adaptation
One of the coolest additions to the paper is the use of Rapid Motor Adaptation (RMA). Since the researchers identified that changes in physics (like a heavier robot or slippery floor) impact performance, they tried to make the agent adaptive.
They trained a version of the agent that estimates the environmental parameters on the fly and adjusts its policy.

Figure 6 shows that when the environment is corrupted (e.g., increased damping or response time), the standard agent (red) fails. The adaptive agent (blue) recovers almost all the lost performance, proving that these black-box agents can be trained to dynamically adjust to the physical world.
Conclusion & Implications
This paper provides a comprehensive look “under the hood” of end-to-end navigation agents. The key takeaways for students and roboticists are:
- Physics Matters: You cannot train a robust real-world robot in a simulation that ignores inertia and momentum.
- Emergent Structures: We don’t need to explicitly program a Kalman Filter or a Mapping system. If we train an RNN with the right data and physics, these structures emerge naturally within the network weights.
- Prediction-Correction: The agent learns to trust its internal prediction of movement and corrects it with visual data, balancing these two sources of information just like a classical control system.
- Short-term vs. Long-term: While the agent is a master of short-term dynamics and local mapping, it still struggles with long-term geometric planning (tunnel vision).
This research bridges the gap between Classical Robotics and Modern AI. It suggests that the future of robotics isn’t about choosing between “programming physics” or “learning everything,” but rather setting up learning environments where the agent is forced to discover physics on its own.
By understanding what the robot learns, we can design better simulators, better architectures, and ultimately, robots that move through our world as naturally as we do.
](https://deep-paper.org/en/paper/2503.08306/images/cover.png)