Introduction

Imagine you are thirsty. You decide to reach for a cup of coffee sitting on your desk. What happens first? Before your arm muscles even engage, your eyes move. You scan the table, saccade toward the cup to lock in its position, and then guide your hand toward it. Once you grasp the cup, your eyes might immediately dart to the coaster where you plan to place it.

This sequence feels instantaneous, but it reveals a fundamental truth about biological intelligence: we do not passively absorb the visual world like a video camera; we actively look in order to act.

In robotics, however, vision has traditionally been much more passive. Robots usually rely on fixed cameras mounted on stands (which suffer from occlusion or low resolution) or cameras mounted on their wrists (which lose sight of the world as soon as the hand moves).

In a fascinating new paper titled “Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop,” researchers from UC Berkeley introduce a robotic system that breaks this mold. They have built a robot with a mechanical, moving eye that learns—without being explicitly told how—to look around its environment to help its hand complete tasks.

EyeRobot Overview: A robot arm and a moving mechanical eye working together to find a towel and place it in a bucket.

As shown in Figure 1, the system, dubbed “EyeRobot,” learns to perform complex tasks like finding a towel (which isn’t even in the initial view), picking it up, and then looking for a bucket to place it in. This behavior isn’t hard-coded. It emerges naturally through a clever combination of Imitation Learning and Reinforcement Learning.

In this post, we will tear down the EyeRobot system, exploring how 360° video data, a foveated vision architecture, and a novel training loop allow a robot to master active vision.

Background: The Problem with Passive Vision

To understand why EyeRobot is significant, we have to look at the limitations of standard robotic perception.

  1. Fixed “Exo” Cameras: A camera on a tripod provides a stable view of the workspace. However, if the workspace is large, the camera must be placed far back, reducing resolution. If the robot reaches behind an object, the view is blocked (occlusion).
  2. Wrist Cameras: Mounting a camera on the robot’s hand allows for high-precision grasping. But there is a catch: the camera is slave to the hand. If the robot needs to pick up an object on the left to place it on the right, the wrist camera is useless for locating the drop-off zone until the hand actually moves there.

Active Vision is the concept that an agent should control its sensors to gather the information it needs. As J.J. Gibson famously noted, “We perceive in order to act and we act in order to perceive.”

EyeRobot implements this by decoupling the “eye” from the “hand.” The eye is a camera on a gimbal (a motorized mount) that can rotate freely. The challenge, however, is training it. How do you teach an eye where to look?

The Core Method: EyeRobot

The researchers propose a system where the gaze policy (where to look) and the manipulation policy (how to move the arm) are trained together.

1. The Hardware: A Mechanical Eyeball

The physical setup mimics biological constraints. The team mounted a global-shutter RGB camera with a fisheye lens onto a high-speed gimbal. This “eyeball” is mounted rigidly near the base of the robot arm. It has two degrees of freedom (pan and tilt), allowing it to scan the environment rapidly.

EyeRobot Hardware and Framework. Left: The physical mechanical eye. Right: The training loop diagram.

As seen on the left side of Figure 2, the setup is relatively simple hardware-wise, but the magic lies in the software and training methodology.

2. Scalable Data Collection with EyeGym

Training a reinforcement learning (RL) agent usually requires a simulation environment where the agent can try millions of actions. Building a photorealistic 3D simulation of the real world (using engines like Unity or MuJoCo) is difficult and often suffers from the “sim-to-real gap”—where things learned in the simulator don’t work in the real world because the physics or lighting aren’t perfect matches.

The authors devised a brilliant workaround called EyeGym.

Instead of building a 3D model of the lab, they recorded the real world using a 360° camera (specifically, an Insta360 X4). They teleoperated the robot to perform tasks while recording the entire panoramic scene.

Data Collection Setup. A large robot arm and a 360 camera are used to capture demonstrations.

Figure 5 shows the setup. By capturing the full 360° sphere of visual data during a demonstration, they effectively captured “every possible place the robot could have looked” at any moment.

This data is then imported into EyeGym. During training, the “eye” agent chooses a viewing angle (azimuth and elevation). EyeGym simply crops the corresponding section from the recorded 360° video.

EyeGym Visualization. Shows how the system simulates looking around by cropping 360-degree video data.

As Figure 3 illustrates, this creates a “Real-to-Sim” environment. The robot “moves” its eye in simulation, but the images it sees are real photographs. This allows the system to train on authentic textures and lighting without the overhead of 3D rendering.

3. The BC-RL Perception-Action Loop

This is the heart of the paper. The goal is to train two agents:

  1. The Hand (Arm) Agent: Needs to manipulate objects.
  2. The Eye Agent: Needs to provide the best view for the Hand.

The researchers use a co-training loop combining Behavior Cloning (BC) and Reinforcement Learning (RL).

The Loop (refer back to Figure 2, Right Panel):

  1. The Eye Looks: The RL Eye Policy observes the history and decides on a gaze direction (an action).
  2. The View is Rendered: EyeGym crops the 360° video based on that gaze.
  3. The Hand Predicts: The BC Arm Policy takes this cropped image and tries to predict the correct robot arm movement (the “Action Chunk”) to match the human demonstration.
  4. The Reward: This is the key. The Eye Agent is rewarded based on how accurately the Arm Agent predicted the action.

If the eye looks at a blank wall, the arm has no visual context and will likely predict the wrong movement. The Eye Agent gets a low reward. If the eye looks directly at the target object, the arm’s prediction accuracy improves, and the Eye Agent gets a high reward.

Thus, gaze emerges from the need to act. The eye learns that to get a reward, it must look at the things that matter to the hand.

4. Foveal Robot Transformer (FoRT)

Biological eyes do not see everything in high resolution. We have a “fovea”—a small center area of high acuity—and a lower-resolution periphery. This is computationally efficient.

The authors replicated this using the Foveal Robot Transformer (FoRT).

Architecture of the Foveal Robot Transformer (FoRT). Visual inputs are processed at multiple scales.

As shown in Figure 4, the architecture processes observations in a foveated manner:

  • Multi-Scale Crops: The system takes the current view and creates a pyramid of crops (zoomed in, medium, zoomed out).
  • Tokenization: These crops are fed into a Transformer network along with proprioception (joint angles) and gaze direction.

This architecture allows the robot to maintain a broad awareness of the scene (periphery) while focusing detail where it’s needed (fovea). As we will see in the results, this is crucial for tracking objects and ignoring distractors.

Experiments & Results

The team evaluated EyeRobot on 5 challenging tasks in a large, panoramic workspace. These tasks required the robot to interact with objects across a 210° arc—far wider than a standard static camera could cover effectively.

The 5 Evaluation Tasks: Eraser, E-Stop, Screwdriver, Towel, and Brush Handoff.

Figure 6 displays the tasks, which include:

  • Eraser: Picking an eraser and placing it on a shelf.
  • E-Stop: Reaching for an emergency stop button that moves.
  • Screwdriver: Aligning a tool tip to a target.
  • Towel: The “Towel in Bucket” task (seen in the intro).
  • Brush Handoff: Handing a brush to a human.

Performance vs. Conventional Cameras

The researchers compared EyeRobot against standard setups: an Exo (external fixed) camera and a Wrist camera.

Comparison of views from External, Eyeball, and Wrist cameras.

Figure 8 highlights the visual difference. The external camera (top right) is too zoomed out to see fine details. The wrist camera (bottom right) is too focused and misses the context. The Gimbal Eye (bottom left) offers high resolution and flexibility.

The quantitative results were stark.

Table 1: Comparison of success rates between EyeRobot, Exo, and Wrist cameras.

Table 1 shows that EyeRobot achieved 100% success on the E-Stop task, whereas the wrist camera struggled (80%) and had higher spatial error. On the Eraser task, specifically when the placement location was perturbed (moved), the Wrist camera failed almost completely (10% success) because it couldn’t look around to find the new shelf location while holding the eraser. EyeRobot maintained 100% success.

Emergent Behaviors

The most exciting part of this research is observing what the eye learned to do. Remember, the researchers never programmed “look at the bucket.”

Emergent Eye Behaviors: Target Switching, Fixation, and Independent Tracking.

Figure 7 illustrates three distinct behaviors that emerged purely from the BC-RL loop:

  1. Target Switching (Left): In the towel-bucket task, the eye watches the towel while the hand grasps it. Once the grasp is secure, the eye automatically snaps its gaze to the bucket to guide the placement. It understands the sequence of the task.
  2. Fixation (Middle): The eye learns to keep the target object centered in its fovea. This stabilizes the view for the arm agent.
  3. Independent Tracking (Right): If an object moves (like the E-Stop button), the eye tracks it. Crucially, the eye can track the object independently of the hand’s position, something a wrist camera cannot do.

The Importance of Foveation

Is the fancy “foveated” transformer architecture actually necessary? The authors performed an ablation study to find out.

Table 2: Ablation study showing the impact of foveation on error and speed.

Table 2 compares EyeRobot (with foveation) to a model using uniform resolution (No Foveal). The results show that foveation leads to lower error (4.4cm vs 6.4cm) and faster task completion (5.0s vs 6.2s).

Why? The multi-scale visual tokens help the model distinguish between the target and background distractors. In one video demonstration, when a human waved a yellow cup (distractor) near the target, the non-foveated model got confused and looked at the cup. The foveated model ignored it and stayed locked on the target.

Conclusion

“Eye, Robot” represents a significant step toward more biological, embodied robotic perception. By moving away from static camera inputs and embracing active vision, the system overcomes the trade-off between field-of-view and resolution.

The key takeaways are:

  1. Perception serves Action: The eye policy was trained solely to maximize the success of the hand, leading to intelligent, emergent behaviors like look-ahead search and object tracking.
  2. Sim-to-Real via Video: Using cropped 360° video (EyeGym) is a highly effective, scalable way to train active vision agents without building complex 3D renders.
  3. Foveation Matters: Processing images at multiple scales (fovea + periphery) improves tracking stability and robustness against distractors.

While the system currently lacks motion parallax (since the head doesn’t move side-to-side, only rotates), this work lays the foundation for future mobile robots that don’t just see the world, but actively watch it to perform their work better.