Imagine you have just finished a dinner party. The table is cluttered with stacked bowls, plates with cups resting on them, and scattered cutlery. Your task is to “bus” the table—move everything into a neat stack to be carried to the kitchen. To you, this is trivial. To a robot, it is a nightmare.

This is a long-horizon manipulation task. It requires reasoning about physics (you can’t stack a plate on a cup), geometry (do I have space to set this down?), and sequencing (I need to move the cup before I can move the plate beneath it).

Traditionally, roboticists have solved this using symbolic planning. They would turn the world into text: on(cup, plate), is_free(table). The robot plans a sequence of logic, then executes it. But this requires perfect knowledge of the world and a human to manually define every possible object relationship. On the other hand, modern end-to-end learning (like Diffusion Policies) tries to map pixels directly to robot movements. While powerful, these methods struggle to “think ahead” for tasks requiring many complex steps.

In this post, we are diving into SPOT: Search over Point cloud Object Transformations, a paper presented at CoRL 2025. This research proposes a hybrid approach: it keeps the raw visual data (point clouds) but applies rigorous search algorithms (A*) to find a solution. It is a system that plans by visualizing the consequences of its actions in 3D space.

The Core Problem: Discrete vs. Continuous

To understand why SPOT is significant, we need to understand the “Discretization Bottleneck.”

Old-school planners work in discrete spaces. You are either in Room A or Room B. A cup is either on the table or held. But the real world is continuous. If you place a cup 1cm too far to the left, it might collide with a bowl. If you stack a block slightly off-center, the tower falls.

Most planners solve this by simplifying the world into symbols, losing critical geometric details. SPOT asks a different question: Can we run a search algorithm directly over the continuous geometry of the world?

The authors formulate the problem as Multi-Object Rearrangement. Given a 3D scan (point cloud) of a messy scene, the robot must find a sequence of actions—specifically, rigid body transformations (SE(3))—to move objects into a goal configuration.

The SPOT Architecture

SPOT is a “hybrid” system. It uses Planning (A* Search) to organize the decision-making process, but it uses Deep Learning to generate the specific moves the planner considers.

It doesn’t scan the whole world and try every mathematical possibility—that would take forever. Instead, it uses learned “Suggesters” to propose likely actions, and then searches through those proposals to find a valid sequence.

Here is the high-level system overview:

The system is composed of three main learned components that guide the A* search:

Object Suggester: Decides what to move.
Placement Suggester: Decides where to move it.
Model Deviation Estimator (MDE): Decides if the move is physically safe.

Let’s break these down.

1. The Learned Suggesters

In a standard search algorithm, you expand a “node” (a state of the world) by listing every possible next move. In a continuous 3D world, there are infinite moves. SPOT limits this infinity using learned priors.

The Object Suggester

First, the robot looks at the point cloud and asks: “Which of these objects should I move next?”

It doesn’t make sense to move a plate if a heavy bowl is sitting on top of it. The Object Suggester is a neural network (PointNet++) that takes the current point cloud and predicts a probability distribution over the objects. It effectively learns preconditions from demonstration data—learning, for example, that top-most objects are usually moved first.

The Placement Suggester

Once an object is selected (e.g., the robot decides to move the “Red Block”), it needs to know where to put it.

The Placement Suggester is a Conditional Variational Autoencoder (cVAE) based on TAXPose-D. It takes the object and the scene as input and samples potential 3D transformations. It might suggest: “Place the red block on top of the blue block” or “Place the red block on the table.”

Critically, this model handles multimodality. There isn’t just one right place to put an object. You could stack the cup, or you could set it aside to make room. The placement suggester outputs multiple options, allowing the planner to explore different branches.

Figure 2: Learned object and placement suggesters. Left: The object suggester predicts a probability distribution over which objects in the scene can be feasibly moved, given a point cloud observation of the scene. Right: Given an object and the scene point cloud, the placement suggester samples candidate transformations indicating where the object could be moved next.

The mathematical formulation for sampling an action involves drawing from both these distributions:

Equation describing sampling an action from the object and placement suggesters

2. The Model Deviation Estimator (MDE)

The suggesters are optimistic—they suggest where an object should go. But they don’t simulate physics. If the Placement Suggester says “put the plate here,” but there is already a cup there, a blind execution would result in a crash.

The authors introduce a Model Deviation Estimator (MDE). This model looks at a proposed action and predicts the “deviation”—how much the actual result will differ from the intended result.

Low Deviation: The robot moves a free-standing cup to an empty spot. The prediction matches reality. Safe.
High Deviation: The robot tries to move a plate that has a bowl on top of it. The bowl would fall off (physics), meaning the resulting point cloud would look very different from a simple rigid transformation. The MDE flags this as “High Deviation.”

The MDE acts as a “physics intuition” module, guiding the search away from unstable or dangerous actions.

3. A* Search over Point Clouds

Now comes the planning engine. SPOT uses A Search*, a classic pathfinding algorithm.

Nodes: Each node in the search tree is a full 3D point cloud of the scene.
Edges: Actions (moving one object).
Goal: A point cloud configuration that satisfies a specific condition (e.g., “all blocks are stacked”).

A* needs a cost function (\(g(n)\)) to determine the best path. SPOT’s cost function is a weighted sum of several factors:

Equation: g(n) = C_a(n) + w_c C_c(n) + w_d C_d(n) + w_p C_p(n)

Where:

\(C_a\): Action Cost (prefer shorter plans).
\(C_c\): Collision Cost (avoid intersecting objects).
\(C_d\): Deviation Cost (avoid physics violations, provided by the MDE).
\(C_p\): Probability Cost (prefer actions the learned suggesters are confident about).

This probability cost \(C_p\) is particularly interesting. It penalizes actions that the neural networks think are unlikely (low probability), effectively keeping the search grounded in the distribution of the training data:

Equation for Probability Cost C_p(n)

By combining these costs, SPOT can explore the “tree” of possibilities. It might simulate moving a cup to the left, realize that blocks the plate, backtrack, and try moving the cup to the right instead.

Experimental Setup

The researchers tested SPOT in three distinct environments requiring complex sequential reasoning:

Block Stacking (Simulation): Stacking blocks in a specific color order (Red -> Green -> Blue). If the blocks are initially stacked wrong, the robot must unstack them first.
Constrained Packing (Simulation): Fitting objects into a tight shelf.
Table Bussing (Real World): The “dinner party” scenario. Consolidating plates, bowls, and cups into a single stack.

To validate the real-world capabilities, they built a physical setup with a Franka Emika Panda arm and RGB-D cameras.

Figure 8: Real-world setup. Left: Our real-world setup of the table bussing environment with the RGB-D camera, Franka arm, and a set of objects. Right: All of the objects (plates, bowls, and cups) seen in the table bussing environment.

Results: Why Planning Matters

The results highlight a crucial limitation of pure learning-based approaches and the strength of search-based planning.

Beating End-to-End Learning

The authors compared SPOT against a 3D Diffusion Policy (DP3). DP3 is a state-of-the-art imitation learning method that tries to copy human actions directly.

In tasks requiring only 2 steps, DP3 performed decently. But as the horizon grew to 3 or 4 steps (e.g., needing to unstack two blocks before restacking), DP3 failed completely (0% success). It simply couldn’t track the long-term dependencies. SPOT, however, maintained high success rates even as complexity increased.

Table 2: Execution success rates on block stacking tasks. Complexity is measured by the length of a ground-truth optimal plan to reach the goal. Results are averaged over 4 seeds for 3D Diffusion Policy and 5 seeds for all other methods.

This trend is visualized clearly in the simulation success rates below. Notice how the baseline (green) crashes as the number of steps increases, while SPOT (blue) remains robust.

Figure 9: Task execution success rate as a function of task complexity in the simulation block stacking environment. Results are averaged over 5 seeds and show a 95% confidence interval.

Surpassing Human Demonstrators

One of the most fascinating results is that SPOT can be more efficient than the humans it learned from.

The “Placement Suggester” is trained on human demonstrations. However, because SPOT uses A* search, it can combine these learned movements in novel ways. In one table bussing example, a human demonstrator moved a cup off a plate, set it down, and then stacked the bowls.

SPOT realized it could move the cup directly to a final position that allowed the bowls to be stacked immediately, saving an entire step. Because it explores the tree of possibilities, it found a shortcut the human missed.

Figure 6: Efficient Path-finding. SPOT finds a more efficient plan than the video demonstration by first moving the cup beside the plate, creating space to stack the bowls directly. In contrast, the demonstration spends an extra step to remove the cup from the plate.

Visualizing the Search

The power of this method is best seen in the Plan Graphs. These images visualize the different paths the A* algorithm explored.

In the example below, the robot considers different strategies. It looks at paths involving moving the cup first, or moving the bowl first. It prunes paths that result in high collision costs (illegal plans) and selects the path that satisfies the goal with the lowest cost. The ability to “imagine” these branching futures in point-cloud space is the core innovation of SPOT.

Conclusion

SPOT represents a shift in how we think about robot planning. It moves away from the rigid, hard-to-define world of symbolic states (cup_on_table = True) without falling into the trap of black-box policies that cannot reason over long horizons.

By treating raw point clouds as states and using learned models to guide a classic search algorithm, SPOT achieves the best of both worlds: the flexibility of learning and the reasoning reliability of planning.

Key Takeaways:

No Discretization: Planning happens in the continuous space of point clouds and 6-DoF transforms.
Hybrid Intelligence: Neural networks propose plausible moves; A* search vets them and stitches them into a coherent plan.
Physics Awareness: A Model Deviation Estimator prevents the planner from hallucinating physically impossible moves.
Long-Horizon Robustness: Unlike pure imitation learning, SPOT excels at tasks requiring 3, 4, or more sequential steps.

As robots move out of factories and into messy homes (bussing tables, packing groceries), approaches like SPOT that can reason about geometry and sequence simultaneously will be essential.

The Core Problem: Discrete vs. Continuous#

The SPOT Architecture#

1. The Learned Suggesters#

The Object Suggester#

The Placement Suggester#

2. The Model Deviation Estimator (MDE)#

3. A* Search over Point Clouds#

Experimental Setup#

Results: Why Planning Matters#

Beating End-to-End Learning#

Surpassing Human Demonstrators#

Visualizing the Search#

Conclusion#