Introduction: The Quest for the Generalist Household Robot

Imagine a robot that can walk into any kitchen, identify the ingredients for a meal, find the necessary cookware, and start cooking—all without having ever seen that specific room before. This is the “Holy Grail” of Embodied AI: a generalist robot capable of performing complex, multi-stage tasks in diverse, unstructured environments.

However, there is a massive roadblock standing between us and that future: Data.

Training these intelligent agents requires oceans of data—specifically, interaction data where a robot manipulates objects. Collecting this in the real world is slow, expensive, and potentially dangerous. While simulation has long been the proposed solution, existing platforms have suffered from a “Goldilocks” problem. Some are great at generating diverse static scenes but lack realistic physics. Others have great physics but rely on small, fixed sets of objects. Most critically, very few effectively handle mobile manipulation—the ability for a robot to move around a room while using its hands.

Enter AgentWorld, a new interactive simulation platform presented at CoRL 2025. This research introduces a unified framework that combines high-fidelity, procedurally generated environments with a robust mobile teleoperation system.

In this deep dive, we will explore how AgentWorld leverages game engines like Unreal Engine and NVIDIA’s Isaac Sim to construct realistic homes, how it solves the complex problem of collecting training data for mobile robots, and what this means for the future of imitation learning.

Figure 1: Overview of AgentWorld. AgentWorld simulation platform features several core abilities for embodied AI: (1) Procedural scene construction supporting various layout generation. (2) Abundant semantic 3D assets repository with realistic visual material and physical properties. (3) Mobile-based Teleoperation system for robotic manipulation.

Background: The Simulation Landscape

To appreciate what AgentWorld brings to the table, we first need to understand the current landscape of robotic simulation.

For years, researchers have relied on platforms like AI2-THOR, Gibson, or ManiSkill. These tools have been instrumental, but they often impose hard limits on what a researcher can do. Typically, you encounter one of two trade-offs:

  1. Fixed Bases: The robot is bolted to the floor. It can learn to pick up a cup, but it can’t learn to walk to the sink.
  2. Static Assets: The environments look nice, but you can’t interact with them meaningfully (e.g., doors don’t open, or materials don’t behave physically).

The researchers behind AgentWorld identified a gap: there wasn’t a platform that combined Procedural Scene Construction (making infinite unique rooms) with Mobile-based Teleoperation (controlling a moving robot to collect data).

Table 1: Comparison of robotic simulation platforms in terms of asset properties, robotic platform support, and data collection capabilities. Fixed-B and Mobile-B stands for fixed and mobile base robots. The teleoperation colomn demonstrates support for joint action control (J), floating base control (FB), and locomotion control (L) for humanoid robots. AgentWorld represents our proposed platform integrating all key capabilities.

As shown in Table 1 above, AgentWorld distinguishes itself by checking every box. It supports over 9,000 assets with material selection, realistic physics configurations, and critical support for mobile bases (both wheeled and legged). This holistic approach is designed to bridge the “Sim-to-Real” gap—the difficulty of taking a brain trained in a video game and making it work on a physical robot.

Core Method: Constructing the World

The heart of the AgentWorld platform is its ability to generate the world itself. The researchers built a pipeline that doesn’t just place random objects in a void; it constructs “semantically meaningful” environments. This ensures that a toaster appears in a kitchen, not a bedroom, and that it sits on a counter, not the floor.

The construction pipeline operates in four distinct stages, leveraging the rendering power of Unreal Engine and the physics accuracy of NVIDIA Omniverse Isaac Sim.

Figure 2: Pipeline of the scene construction module in Agent World.

1. Layout Generation

The process begins with the architectural shell. Instead of using pre-scanned 3D meshes of rooms (which are static and hard to modify), AgentWorld procedurally generates layouts. It determines walls, ceilings, floors, and even stairs for multi-story environments. It supports three primary room types: Living Rooms, Kitchens, and Bedrooms. By algorithmically varying the dimensions and connections between rooms, the system ensures that the robot never “overfits” to a single floor plan.

2. Semantic Asset Selection and Placement

Once the walls are up, the room needs furniture. This is where the Semantic Asset engine comes in. The system utilizes a massive library of 3D assets categorized into:

  • Basic Assets: Furniture like sofas, beds, and tables.
  • Interactable Assets: Objects the robot manipulates, such as microwaves (articulated) or fruits and knives (rigid).

The placement isn’t random. The system uses semantic rules to ensure functional plausibility. For example, pillows are spawned on beds, and food items are placed on dining tables. This seemingly small detail is crucial for training “common sense” into AI agents—teaching them where to look for specific items.

3. Visual Material Configuration

One of the biggest reasons robots fail when moving from simulation to reality is visual discrepancy. A simulated wooden table might look like a flat brown texture, whereas a real table has grain, gloss, and imperfections.

AgentWorld tackles this using Physically Based Rendering (PBR) materials. The system can dynamically swap materials on objects—changing a floor from marble to brick, or a cabinet from wood to metal. This variety acts as a powerful form of “data augmentation,” forcing the robot’s vision system to focus on the shape and utility of an object rather than just its color or texture.

4. Interactive Physics Simulation

Finally, the world must behave according to the laws of physics. This stage is handled by NVIDIA Isaac Sim’s GPU-accelerated PhysX 5.0 engine.

Visuals are handled in Unreal Engine for beauty, but physics requires math. The system automatically calculates:

  • Collision Primitives: Simplified shapes (convex hulls) that approximate complex objects so the robot doesn’t clip through them.
  • Mass and Friction: A metal pot should slide differently than a cardboard box. The system assigns friction coefficients (e.g., Wood: \(0.4 \pm 0.1\)) based on the material type.
  • Articulation: For objects with doors (fridges, ovens), the system configures joints and movement ranges, ensuring a drawer pulls out rather than swinging open.

The Mobile-Based Teleoperation System

Building a realistic world is only half the battle. To train an imitation learning agent (an AI that learns by watching), you need demonstrations. You need a human to “pilot” the robot to show it how to perform tasks.

This is notoriously difficult for mobile manipulation. Controlling a robotic arm is hard; controlling a robotic arm while driving is a nightmare. AgentWorld introduces a dual-mode data collection system that splits the cognitive load.

Figure 3: Data collection system of AgentWorld. For the mobile-base control, we allow the users to use the keyboard to to control robots, both wheel-based and legged. For arm & hand control, we use the VR head set to get the hand pose and compute IK for obtaining the arm action, and utilize re-targeting methods to drive robotic hands.

Mode A: Mobile-Base Control

The researchers simplified navigation by mapping it to standard WASD keyboard controls, familiar to anyone who has played a PC video game.

  • W/S: Move Forward/Backward.
  • A/D: Turn Left/Right.
  • Q/E: Rotate Torso.

For wheeled robots, this directly controls velocity. For humanoid (legged) robots, the system utilizes a reinforcement learning-based locomotion policy. This means the human operator doesn’t need to manually control every joint in the robot’s legs. They simply press “W” to go forward, and the underlying policy handles the complex balance and foot placement required to walk. This abstraction allows the operator to focus on the task at hand.

Mode B: Arm & Hand Control

While the keyboard handles movement, the complex manipulation is handled via a VR headset. This creates an immersive interface where the operator’s real hand movements are mapped to the robot.

The system uses Dex-Retargeting. Since a human hand has different dimensions and kinematics than a robot gripper or a multi-fingered robot hand, direct mapping doesn’t work. Retargeting algorithms translate the position of human finger keypoints into the joint angles of the robot.

  • Grippers: The system calculates the normalized distance between the operator’s thumb and index finger to control the open/close state of a gripper.
  • Dexterous Hands: For complex hands (like the TRX-Hand5), the system maps full finger articulation, allowing for intricate gestures.

By combining these two modes, AgentWorld enables the collection of long-horizon data. A user can walk a robot from the living room to the kitchen (keyboard), approach a fridge, open it (VR), pick up an apple (VR), close the door (VR), and walk back (keyboard).

The AgentWorld Dataset and Experiments

Using this platform, the researchers constructed the AgentWorld Dataset. It contains over 1,000 manipulation trajectories across 150 unique scenes, utilizing 4 different robot embodiments (including the Unitree G1 and H1 humanoids).

The tasks are divided into two categories:

  1. Basic Tasks: Primitive actions like Pick & Place, Open & Close, Push & Pull.
  2. Multistage Tasks: Complex activities requiring sequential logic, such as “Serve Drinks,” “Organize Books,” or “Heat up Food.”

Benchmarking Imitation Learning

To prove the dataset’s value, the researchers benchmarked several state-of-the-art imitation learning algorithms: Behavior Cloning (BC), Action Chunking Transformers (ACT), Diffusion Policy (DP), and \(\pi_0\).

The results revealed significant insights into the current state of robotic learning:

  • Short-term success: For basic tasks like “Open & Close,” algorithms like ACT and Diffusion Policy performed well, achieving success rates between 70-80%.
  • The Long-Horizon Struggle: For multistage tasks, performance dropped sharply. The \(\pi_0\) model, which integrates language and vision, performed best but still only achieved 20-30% success rates on tasks like “Living Room Organization.”

This low success rate is actually a positive indicator for the dataset—it shows that AgentWorld provides a sufficiently challenging benchmark that existing models have not yet saturated. It highlights the difficulty of combining precise manipulation with mobile navigation.

Figure 4: Qualitative results for different imitation learning algorithms in AgentWorld Dataset, and a Sim-to-real transfer example to validate the availability and generalizability of our data.

Sim-to-Real Transfer

The ultimate test of any simulation is reality. Can a robot trained in AgentWorld function in the real world?

The researchers trained a policy entirely in simulation for a “pick and place” task (putting objects into a bowl) and then fine-tuned it with a small amount of real-world data (few-shot learning). As shown in the qualitative results above (Figure 4), the robot successfully transferred the skill. The rigorous physical parameters and PBR materials used in AgentWorld allowed the policy to generalize to real-world visuals and physics, validating the platform’s fidelity.

Conclusion and Implications

AgentWorld represents a significant step forward in the democratization of Embodied AI research. By automating the tedious process of scene construction and solving the interface challenges of mobile teleoperation, it provides a scalable path to generating the massive datasets robotics desperately needs.

For students and researchers, this paper highlights two critical trends:

  1. The convergence of tools: The future of robotics lies in the integration of diverse technologies—gaming engines for visuals, industrial sims for physics, and VR for human interaction.
  2. The shift to mobile manipulation: As we master stationary pick-and-place, the frontier is moving toward whole-body control and navigation.

While limitations remain—such as the simulation of deformable objects like cloth—AgentWorld offers a robust foundation for the next generation of generalist household robots. It moves us away from static, scripted environments toward dynamic worlds where robots can learn to truly live and work alongside us.