Introduction: The Bull in the China Shop Problem

Imagine asking a robot to grab a specific bottle of soda from a densely packed refrigerator. For a human, this is trivial. We instinctively know how to reach in, avoid knocking over the yogurt, and infer where the back of the bottle is even if we can’t see it. For a robot, however, this “cluttered scene” scenario is a nightmare.

Densely packed objects create severe occlusions. Traditional depth sensors struggle with transparent or reflective surfaces (like glass bottles), often returning noisy data or “holes” in the vision. Furthermore, collecting real-world training data for every possible arrangement of objects is prohibitively expensive. Consequently, most robots today struggle to fetch objects safely without disrupting their environment.

In this post, we dive into FetchBot, a new framework presented at the intersection of robotics and computer vision. FetchBot addresses the challenge of generalizable object fetching in cluttered scenes through a clever Zero-Shot Sim-to-Real approach.

Figure 1: FetchBot overview showing the pipeline from synthetic data to real-world deployment on diverse objects.

As shown in Figure 1, the system is designed to handle diverse geometries, varying layouts, and multiple end-effectors (suction cups and grippers) without ever seeing the real-world scene during training.

The Data Bottleneck and the Simulation Solution

The first hurdle in training a robust fetching policy is data. Real-world data collection is slow and risky—robots break things. Simulation is the obvious alternative, but existing simulators often generate “sparse” scenes where objects are far apart to avoid physics instability. This doesn’t help the robot learn to navigate a crowded shelf.

To solve this, the researchers developed UniVoxGen (Unified Voxel-based Scene Generator).

Voxel-Based Scene Generation

Instead of relying on heavy physics engines to check for collisions during scene generation (which is slow), UniVoxGen voxelizes objects. It treats 3D space as a grid of tiny cubes (voxels).

Figure 3: Fundamental voxel operations including Union, Intersection, Difference, and Transformation.

As illustrated in Figure 3, the system uses fundamental set operations:

  • Union: To add an object to the scene.
  • Intersection: To instantly check if a new object collides with existing ones.
  • Difference: To remove objects.
  • Transformation: To rotate and position objects.

This approach is computationally lightweight, allowing the team to generate 1 million diverse cluttered scenes. These scenes aren’t just random piles; they mimic shelves, tabletops, drawers, and racks (see Figure 10 below), providing the dense “ground truth” data needed to train a smart agent.

Figure 10: Examples of cluttered scenes generated by UniVoxGen across different environments.

The Teacher: Dynamics-Aware Oracle Policy

With millions of scenes available, how do we teach the robot to act? We cannot immediately train a vision-based policy because figuring out what to do from pixels is too hard when you don’t even know how to move safely yet.

The researchers first train an Oracle Policy using Reinforcement Learning (RL). The Oracle is a “teacher” that has cheating privileges: it has access to the perfect, ground-truth state of the simulator (exact positions of all objects, no visual occlusion).

Shaping Behavior with Rewards

The Oracle learns to fetch targets while minimizing disturbance to surrounding items. The reward function is critical here. It isn’t enough to just grab the object; the robot must be gentle.

The researchers used a composite reward function that includes:

  1. Task Reward: For successfully lifting the object.
  2. Safety Constraints: To prevent the robot from moving awkwardly or hitting barriers.
  3. Environment Penalty: To punish the robot if nearby obstacles move (collision).

Equation showing the penalty for excessive acceleration during penetration/collision.

For example, the penalty for penetration (collision causing acceleration) ensures the agent learns to avoid smashing into the shelf.

Equation showing the penalty for translational movement of obstacles.

Similarly, the system penalizes any translational or rotational movement of obstacle objects (as shown above), forcing the policy to find “needle-in-a-haystack” trajectories.

The Student: Vision-Based Policy

Once the Oracle (teacher) is trained, it generates 500k high-quality demonstrations. Now, the goal is to train a Vision Policy (student) that can mimic the Oracle using only visual inputs (RGB cameras), which is all the robot will have in the real world.

This is where the Sim-to-Real gap usually breaks systems. Simulation renders perfect depth. Real-world depth sensors (like Intel RealSense) produce noise, especially on shiny or clear objects. If you train on perfect sim depth, the robot fails in the real world.

FetchBot introduces a two-stage pipeline to bridge this gap, summarized in Figure 2.

Figure 2: The full FetchBot pipeline, from data generation (A) to zero-shot sim-to-real transfer (D).

1. Bridging the Gap with Predicted Depth

Instead of using raw depth from sensors, FetchBot uses a Depth Foundation Model (specifically, “Depth Anything”).

  • In Simulation: RGB images are fed into the foundation model to predict depth.
  • In Real World: Real RGB images are fed into the same foundation model to predict depth.

By using the foundation model as a middleman, the input space becomes consistent. The foundation model tends to smooth out noise and, crucially, can infer the depth of transparent objects based on context, where physical sensors would see “through” the object.

2. Occupancy Prediction: The “X-Ray” Vision

Even with good depth maps, a single camera view has blind spots (occlusions). If a soda can is behind a box, the camera can’t see it. A standard approach is to project the depth map into a 3D point cloud or voxel grid. However, this results in “shadows” or empty spaces where the camera view is blocked.

FetchBot trains its vision encoder to perform Semantic Occupancy Prediction. It forces the model to look at the partial view and guess what the full geometry looks like, including the back of objects and occluded areas.

Figure 6: Comparison of standard RGB-D voxelization (A) vs. FetchBot’s occupancy query (B). The standard method misses the occluded area, leading to a collision.

As shown in Figure 6, standard RGB-D projection leaves an “incomplete” map. The robot thinks the space behind the obstacle is empty and might path through it, causing a collision. FetchBot’s occupancy method infers the hidden geometry, allowing the robot to plan a safe path around the obstacle.

The Mechanism: Deformable Cross-Attention

To achieve this 3D understanding from 2D images, the system uses a mechanism called Deformable Cross-Attention (DCA).

Equation for Deformable Cross-Attention (DCA) used to fuse multi-view features.

The system defines a grid of 3D queries (voxels). For each voxel query (\(q_p\)), it projects the 3D point onto the 2D feature maps from the cameras. It then aggregates features from those specific 2D locations to determine if that 3D voxel is occupied. This effectively fuses information from multiple camera views into a coherent 3D understanding.

Experiments and Results

The researchers evaluated FetchBot in both simulation and the real world, comparing it against heuristics, motion planning (CuRobo), and other learning-based methods (Diffusion Policy, RGB-D methods).

Simulation Performance

In simulation, they measured success rate (did it fetch the item?) and disturbance (how much did other items move?).

Table comparing FetchBot against baselines in simulation.

FetchBot (labeled “Occupancy (Ours)”) outperformed all baselines. Notably, it achieved an 81.46% success rate with suction cups and 91.02% with grippers, while keeping translation disturbance significantly lower than other methods.

Real-World Zero-Shot Transfer

The ultimate test is the real world. The system was deployed on a Flexiv Rizon 4S robot arm with no real-world fine-tuning (Zero-Shot).

Figure 8: Bar charts comparing success rates across different input modalities in the real world.

The results in Figure 8 are striking.

  • Standard Point Cloud methods (DP3) struggled (approx 46-53% success) because real-world depth sensors are noisy.
  • RGB-only methods failed to grasp the 3D geometry needed for obstacle avoidance.
  • FetchBot (Ours) achieved 86.6% (Suction) and 93.3% (Gripper) success rates.

Why does it work better?

The qualitative difference is visible in the scene reconstruction. Figure 7 compares the standard voxelization against FetchBot’s occupancy prediction.

Figure 7: Real-world reconstruction comparison. (A) shows noisy, incomplete voxels from raw sensors. (B) shows the clean, complete reconstruction from FetchBot.

FetchBot’s reconstruction (Figure 7B) is complete and smooth, filling in the gaps that confuse other planners. This robustness extends to challenging materials. As seen in Figure 13 below, the system can reconstruct the geometry of transparent bottles and reflective cans—objects that usually appear invisible or distorted to infrared depth sensors.

Figure 13: Real-world occupancy reconstruction handling varying shapes and materials.

Robustness and Generalization

The system proved capable of generalizing beyond just shelves. It was tested on tabletops and drawers, demonstrating the ability to handle different fetching contexts without retraining.

Figure 9: FetchBot extending to tabletop suction tasks and drawer fetching tasks.

Conclusion and Future Directions

FetchBot represents a significant step forward in embodied AI. By acknowledging the limitations of real-world sensors and the scarcity of data, the authors crafted a solution that leverages the best of both worlds: the infinite scalability of simulation and the semantic power of foundation models.

Key Takeaways:

  1. Intermediate Representations Matter: Using predicted depth rather than raw sensor depth standardizes the input, effectively deleting the Sim-to-Real gap.
  2. Inference over Perception: Don’t just trust what the sensor sees. Training the model to predict occupancy (filling in the blanks) allows for safe navigation in occluded spaces.
  3. Scale via Procedural Generation: Simple, voxel-based generation (UniVoxGen) is sufficient to learn complex, generalizable behaviors.

Limitations: While impressive, the system has limits. Figure 18 highlights scenarios where the robot might reach joint limits attempting to avoid obstacles, or where objects are so large they require dual-arm manipulation. Furthermore, “completely occluded” objects that are physically unreachable require high-level reasoning (move object A to get to object B) which is beyond the current scope of the policy.

Figure 18: Analysis of failure modes, including joint limits, large objects, and completely unreachable targets.

FetchBot demonstrates that with the right abstraction of visual data, robots can learn to navigate the messy, cluttered reality of our world, one voxel at a time.