Introduction
In the field of Artificial Intelligence, language models like GPT-4 have achieved remarkable capabilities largely because they were trained on the entire textual internet. Robotics, however, faces a distinct “data starvation” problem. While text and images are abundant, robotic data—specifically, data that links visual perception to physical action—is incredibly scarce.
Collecting data on real robots is slow, expensive, and potentially dangerous. The alternative has traditionally been simulation (Sim-to-Real), where we build digital twins of the real world. But creating these digital twins usually requires complex setups: multi-view camera rigs, 3D scanners, and manual asset creation. You can’t just take a photo of your messy kitchen and expect a robot to learn how to clean it… until now.
In the paper “Robot Learning from Any Images,” researchers introduce RoLA, a framework that democratizes robotic data generation. RoLA can take a single, standard image—whether it’s a photo from your phone or a random image downloaded from the internet—and transform it into a fully interactive, physics-enabled environment.

As shown in Figure 1, the pipeline takes a static image, recovers the 3D geometry and physical properties, allows a robot to practice in that virtual space, and generates training data that works in the real world. This capability unlocks the potential to use the millions of images already existing on the internet as training grounds for intelligent robots.
The Context: Breaking the Hardware Barrier
To understand why RoLA is significant, we need to look at how “Real-to-Sim” usually works. Traditionally, if you wanted to simulate a specific table with objects on it, you would need to reconstruct the scene geometry from multiple angles (using techniques like Photogrammetry or NeRFs). This confines data collection to controlled laboratory settings with ad hoc camera setups.
The researchers asked a fundamental question: Can we obtain robot-complete data from a single image?
Their insight relies on the power of modern generative AI. We no longer need fifty photos to understand the 3D shape of an apple; foundation models have seen enough apples to guess the shape from a single view. By leveraging these priors, RoLA eliminates the need for complex hardware, bridging the gap between passive visual data (photos) and embodied robotic action.
The RoLA Framework
The RoLA method is a pipeline divided into three logical steps: Recovering the Scene, Generating Data in Simulation, and Sim-to-Real Deployment.

Let’s break down how the system effectively “hallucinates” a physics engine from a flat JPEG.
Step 1: Recovering the Physical Scene
The goal here is to solve an inverse problem. We start with an image \(I\) and want to find the physical scene \(S\) and camera parameters \(C\) such that:

Where \(\pi\) is the camera projection. This is inherently difficult because a single image lacks depth information. RoLA solves this by breaking the image down into its constituent parts: objects and background.
Geometry and Appearance
First, the system uses a segmentation model (Grounded SAM) to identify objects in the image.
- Objects: Once an object is masked out, it is passed to an image-to-3D generative model. This creates a textured 3D mesh of the object.
- Background: When you lift an object out of a 2D image, it leaves a “hole” in the background. RoLA uses an image inpainting model to fill this hole, creating a clean background plate.
To understand the 3D structure of the room, the system uses a metric depth prediction model. This predicts how far away every pixel is, allowing the researchers to construct a “point cloud”—a set of data points in space representing the scene.

Here, \(D(u,v)\) is the depth at pixel \((u,v)\) and \(\mathbf{K}\) is the camera intrinsic matrix. This equation effectively lifts the 2D image into 3D space.
Scene Configuration and Alignment
Having 3D meshes isn’t enough; they need to be placed correctly in a physics simulator. Gravity matters. If the floor in the simulation is tilted relative to the “floor” in the point cloud, objects will slide away immediately.
RoLA assumes the existence of a “supported plane” (like a table or floor) perpendicular to gravity. It estimates the normal vector of the ground \(\mathbf{n}\) and calculates a rotation matrix \(\mathbf{R}\) to align the scene with the simulation’s Z-axis (gravity).

Physics Properties
A mesh has shape, but not mass or friction. How does the simulator know if an object is a heavy brick or a light sponge? RoLA uses Large Language Models (LLMs). By prompting an LLM with the object’s class name and visual context, the system infers plausible physical parameters (density, friction) to populate the physics engine.
Step 2: Robotic Data Generation
Once the scene is built, we need a robot to interact with it.
If the original image was taken by a robot, the system knows where the robot should be. But for random internet images, the robot’s position is unknown. RoLA employs a sampling-based method to find valid spots for the robot base. It calculates a “reachable workspace” shell and samples positions where the robot can reach the objects without clipping through the table.

With the robot placed, the system can now generate thousands of demonstrations. It can use motion planners or pretrained policies to make the robot perform tasks like “pick up the banana” or “pour the water” inside this hallucinated world.
Step 3: Visual Blending for Photorealism
This is perhaps the most critical component for effective Sim-to-Real transfer.
When you render a robot in a simulation, it often looks “fake” or distinct from the real-world background, creating a domain gap that confuses the AI when it tries to operate in the real world. RoLA solves this using a technique called Visual Blending.
Instead of rendering the whole scene from scratch, RoLA keeps the original pixels of the background image \(I_B\). It only renders the robot and the manipulated objects when they are physically in front of the background. This is determined using a Z-buffer (depth comparison).

The blending process ensures that the robot looks like it is truly inside the original photograph. The mathematical formulation for the blended image \(I'_t\) is:

Here, \(M_t\) is a binary mask. It is 1 (show the render) if the rendered depth \(D_t\) is closer to the camera than the background depth \(D_B\), and 0 (show the original photo) otherwise. This simple but effective trick maintains the photorealistic lighting and textures of the original scene while inserting the dynamic robot.
Experiments and Results
The authors subjected RoLA to rigorous testing to answer several key questions. Can single-view reconstruction compete with multi-view? Can we learn from internet images?
Single-View vs. Multi-View
The researchers compared RoLA against a traditional multi-view reconstruction pipeline (which requires video scanning). They found that policies trained in RoLA’s single-image environments achieved a 72.2% success rate, comparable to the 75.5% of the multi-view approach. This suggests that the massive effort of scanning scenes from every angle might not be necessary for many manipulation tasks.

Comparison with Baselines
RoLA was also compared against other single-image methods like ACDC (retrieval-based) and RoboEngine (augmentation-based). RoLA significantly outperformed them.

As seen in the graphs above, RoLA (red line) achieves high success rates across tasks like putting broccoli in a bowl or carrots on a burner, whereas other methods struggle to generalize.
Real-World Deployment
The ultimate test is putting the code on a physical robot. The authors tested RoLA on a Franka Emika Panda robot and a Unitree humanoid.

The visual blending creates a strong correspondence between the simulation (top rows) and the real world (bottom rows). The robot successfully performed tasks like manipulation in cluttered scenes and pouring water, despite only seeing a single static image of the scene during setup.
Learning from the Internet
One of the most exciting applications is learning from “in-the-wild” internet images. The authors used RoLA to generate demonstrations for apple picking using random photos of apples found online.


By pre-training on this diverse internet data, the robot learned a “grasping prior”—a general understanding of how to grab an apple regardless of lighting, size, or background. When fine-tuned on a real robot with just a few examples, the system with the internet prior achieved an 80% success rate with 50 demos, compared to just 30% without the prior.
Scaling Up: Vision-Language-Action (VLA) Models
Finally, the authors demonstrated that RoLA can generate data at scale. They generated over 60,000 demonstrations to train a Vision-Language-Action model (similar to models like RT-2 or OpenVLA).

The model, trained purely on RoLA-generated data, showed strong generalization capabilities in simulation, successfully following language instructions like “put the green pepper beside the lemon.”

Conclusion and Implications
RoLA represents a significant step toward solving the robotics data bottleneck. By turning “any image” into a “robotic environment,” it unlocks the vast visual resources of the internet for embodied AI.
The framework’s core innovations—robust single-view scene recovery, physics-aware placement, and z-buffered visual blending—allow for the creation of photorealistic training data without expensive hardware. While limitations exist (such as the fidelity of physics simulations or occlusions in single images), the ability to generate infinite training data from passive photos suggests a future where robots can learn to interact with the world before they ever step foot in it.
As we move toward general-purpose robots, tools like RoLA will likely be the engines that feed these systems the massive, diverse diet of experiences they need to operate in our complex, unstructured world.
](https://deep-paper.org/en/paper/2509.22970/images/cover.png)