Introduction
Imagine you are teaching a robot to clean a table. You show it how to pick up a single cup and place it in a bin. Now, you scatter twenty cups across a long dining table and tell the robot to clean it up.
For a human, this is trivial. We intuitively understand the concept of “picking up” and “placing,” and we can apply that concept repeatedly, regardless of where the cups are or how many there are. For a robot, however, this is a nightmare.
Most modern robot learning methods, like imitation learning, are great at mimicking specific movements. But if you train them on one cup, they often fail when faced with twenty, or when the environment looks slightly different. They lack the ability to abstract the logic of the task from the pixels or poses of the demonstration.
Traditionally, the solution has been for human engineers to hand-write symbols and rules—telling the robot explicitly what “On(Cup, Table)” means and writing code for “PickUp(Cup).” But hand-coding these worlds is tedious, brittle, and unscalable.
What if a robot could look at a few raw demonstrations and invent its own logic?
In the paper “From Real World to Logic and Back,” researchers from Arizona State University and Brown University introduce LAMP (Learning Abstract Models for Planning). This framework allows a robot to autonomously discover symbolic concepts (like “holding” or “clear”) and high-level actions directly from continuous, unlabeled trajectory data.

As shown in Figure 1 above, the results are striking. A robot trained on picking up a single object can zero-shot generalize to complex scenarios involving up to 18 objects—tasks far beyond what it saw during training.
In this post, we will deconstruct how LAMP bridges the gap between the continuous “real world” of robot sensors and the discrete “logic” of planning algorithms.
The Background: The Two Worlds of Robotics
To understand why LAMP is significant, we need to understand the divide in robotics.
- The Continuous World (Motion Planning): Robots live in a continuous space. They have joints that rotate by degrees and grippers that move in millimeters. To move from point A to point B, they compute a trajectory—a path through this continuous space. This is computationally heavy and difficult to scale over long horizons (e.g., thousands of small moves).
- The Discrete World (Task Planning): To solve complex problems, we usually think in symbols. “Pick up the apple” is a symbolic action. It has preconditions (hand must be empty) and effects (hand is now full). This is the world of Task and Motion Planning (TAMP), often using languages like PDDL (Planning Domain Definition Language).
The problem is the Interface. Who defines the symbols? Who tells the robot that a specific set of joint angles counts as “Holding the Apple”? Historically, humans did.
Recent attempts to automate this using Large Language Models (LLMs) or Behavior Cloning (BC) have limitations. BC struggles to generalize to new horizons (it memorizes trajectories). LLMs often “hallucinate” plans or require pre-existing APIs to function.
LAMP proposes a third way: let the robot look at the geometry of the world and statistically determine what “symbols” matter.
The Core Method: How LAMP Works
LAMP stands for Learning Abstract Models for Planning. The goal is to take a set of raw demonstrations (unlabeled kinematic trajectories) and output a fully functioning symbolic world model that can be used by a standard planner.
The architecture is broken down into a pipeline that transforms raw data into high-level intelligence.

Let’s break this down into three steps: finding critical regions, inventing relations, and inventing actions.
Step 1: Discovering Relational Critical Regions (RCRs)
The researchers hypothesize that high-level actions (like “grasping”) are actually just transitions into and out of “salient regions” in the environment.
However, absolute position doesn’t matter much. A “grasp” looks the same whether the cup is on the left side of the table or the right. What matters is the relative pose between the gripper and the object.
LAMP analyzes the training data to find Relational Critical Regions (RCRs).
- Data Processing: The system converts raw trajectories into relative poses between pairs of objects (e.g., Gripper-relative-to-Cup, Cup-relative-to-Table).
- Clustering: It looks for regions in this relative space where the system spends a lot of time or where specific interactions happen reliably.
- GMMs: It fits Gaussian Mixture Models (GMMs) to these clusters.

In Figure 2 (b) and (c), you can see this visualized. The red shaded area represents a learned RCR. The robot realizes, “Hey, whenever I am in this specific position relative to the can, something interesting happens.” It doesn’t know the word “Grasp,” but it has mathematically defined the region of grasping.
Step 2: Inventing Semantically Well-Founded Concepts
Once the robot has identified these critical regions (the GMMs), it needs to turn them into logic. Logic is binary: True or False.
LAMP introduces a Relation Inventor. For every RCR identified between two object types (e.g., gripper and can), LAMP creates a binary predicate.
- If the relative pose of the gripper and can falls inside the GMM’s high-probability zone, the relation is True.
- Otherwise, it is False.
This effectively discretizes the continuous world. The robot automatically generates a vocabulary. It might invent a predicate Relation_1(Gripper, Can) which we humans would interpret as Holding(Gripper, Can). It might invent Relation_2(Can, Table) which effectively means On(Can, Table).

Figure 3(a) illustrates this beautifully. The red dots show sampled poses that satisfy the invented relation. The robot has autonomously grounded the concept of “near” or “holding” into physical geometry.
Step 3: Inventing High-Level Actions
Now the robot has a vocabulary (Relation_1, Relation_2, etc.). It can now look at its training demonstrations and translate them from a stream of continuous numbers into a sequence of abstract states.
The Abstraction Process:
- Lifted States: The robot looks at a demonstration. At time \(t=0\),
Relation_2is True. At time \(t=50\),Relation_2becomes False andRelation_1becomes True. - Transition Identification: This change represents a high-level action. The robot records the “Preconditions” (what was true before) and the “Effects” (what changed).
- Action Clustering: It groups similar transitions together to define a standardized symbolic action.

Look at Figures 6 and 7. To a standard camera, these are different scenes (different cup colors, different locations). But to LAMP, the relational changes are identical. The Can-On-Table relation turns off, and the Gripper-Holding-Can relation turns on.
By aggregating these observations, LAMP writes its own PDDL action files, complete with parameters, preconditions, and effects (as seen in Figure 3b).
The Planning Loop
Once the model is learned, the robot no longer needs demonstrations. It has a symbolic world model. When given a new task (e.g., “Make sure all 10 cups satisfy Relation_2”), it uses a classical planner to search for a solution.
Because the planner operates on symbols, it is incredibly fast and can solve tasks with horizons much longer than the training data. The planner outputs a sequence of high-level actions, which LAMP then refines back into motor movements using the generative properties of the GMMs (sampling a pose from the “Critical Region”).
Empirical Evaluation: Does it Work?
The researchers tested LAMP in five domains, including box packing, setting a dinner table, and building structures with Keva planks. The training data was sparse—at most 200 demonstrations of simple tasks (often with just 1-3 objects).
The test tasks, however, were massive.
Generalization Factor
The primary metric used was the Generalization Factor: the ratio of the number of objects in the test task vs. the training task.
- Imitation Learning (BC): Usually has a generalization factor of 1. If you train on 3 blocks, it fails on 4.
- LAMP: Achieved factors up to 18x. In the “Cafe” domain, it trained on 1 object and solved scenarios with 18 objects.

In Figure 4(a), look at the blue bars versus the red line. The red line represents the “imitation learning zone.” LAMP shatters this ceiling.
Comparison with Baselines
The authors compared LAMP against:
- STAMP: A robot with human-engineered symbolic models. LAMP matched its performance, proving that the learned abstractions were as good as those designed by experts.
- Code-as-Policies (CoP): An LLM-based approach. CoP struggled significantly, solving under 35% of tasks, mostly because it lacks the precise geometric grounding that LAMP derives from the data.
Figure 4(b) shows that LAMP is also highly sample efficient. It begins to invent effective world models with as few as 40 successful demonstrations.
Conclusion and Implications
The “From Real World to Logic and Back” paper presents a significant step forward in robotic autonomy. By enabling robots to invent their own concepts, we remove one of the biggest bottlenecks in robotics: the human engineer who has to manually define what “picking up” means.
Key Takeaways:
- Abstraction is Key: Robots don’t need to memorize pixels; they need to understand relationships.
- Geometry \(\rightarrow\) Logic: The bridge between continuous sensing and discrete planning lies in “Relational Critical Regions.”
- Zero-Shot Generalization: Once a robot understands the logic of a task, it can scale that task to complexities far beyond its training data.
This work suggests a future where we can show a robot a simple example of a chore—like loading a single dish into a dishwasher—and the robot can autonomously construct a mental model robust enough to clean up an entire banquet hall. It moves us from robots that simply parrot our movements to robots that understand the structure of the world.
](https://deep-paper.org/en/paper/2402.11871/images/cover.png)