Introduction

In the context of human motor skills, placing an object is deceptively simple. Whether you are hanging a mug on a rack, sliding a book onto a shelf, or inserting a battery into a remote, your brain seamlessly processes the geometry of the objects, identifies valid target locations, and coordinates your hand to execute the move. You don’t need to be retrained from scratch every time you encounter a slightly different mug or a new type of shelf.

For robots, however, this task is notoriously difficult. The challenges are threefold:

  1. Geometric Diversity: Objects come in infinite shapes and sizes.
  2. Multimodality: There isn’t just one “correct” answer. A rack might have five hooks; a box might have space for twelve vials. A robust system must identify all valid possibilities.
  3. Precision: A stacking task might tolerate a centimeter of error, but inserting a peg into a hole requires sub-millimeter accuracy.

Traditional approaches have struggled to solve all three simultaneously. Some methods rely on massive amounts of specific demonstration data, making them brittle when faced with new objects. Others focus on “one-shot” imitation but fail to generalize to scenes with multiple placement options.

In this post, we will dive deep into AnyPlace, a new research paper presented at CoRL 2025. This work proposes a two-stage pipeline that solves the placement problem by combining the semantic reasoning of Vision-Language Models (VLMs) with the geometric precision of Diffusion Models. Perhaps most impressively, the system is trained entirely on synthetic data yet generalizes zero-shot to the real world.

Execution of the AnyPlace approach by the robot.

Background: The Challenges of Robotic Placement

To understand why AnyPlace is a significant contribution, we need to establish a few foundational concepts in robotic manipulation.

The Problem of Rearrangement

Robotic manipulation is often framed as an object rearrangement problem. Given a source object (like a vial) and a target object (like a rack), the goal is to predict the relative transformation—specifically a Rigid Body Transformation in \(SE(3)\)—that moves the source object from its initial pose to a stable, valid final pose.

\(SE(3)\) stands for the Special Euclidean group in 3 dimensions, which essentially means the combination of translation (moving in x, y, z) and rotation (orientation).

The Limitations of End-to-End Learning

A common approach is to feed a full image of the scene into a neural network and ask it to predict the final end-effector pose. While this can work for specific tasks, it suffers from poor generalization. If the model is trained on a specific brown shelf, it might fail when presented with a white shelf or a shelf with a slightly different shape. Furthermore, vision foundation models (VFMs) often lack the precise spatial reasoning required for fine-grained tasks like “peg-in-hole” insertion.

The Role of Synthetic Data

Gathering real-world robot data is expensive and slow. Ideally, we want to train robots in a simulator (where physics is fast and data is infinite) and have them work in the real world. This is known as Sim-to-Real transfer. AnyPlace leans heavily into this, generating a massive dataset of “shape mating” pairs (objects that fit together) to teach the robot general geometric compatibility.

The AnyPlace Method

The core insight of the AnyPlace paper is that we should not ask a single model to do everything. Instead, the authors decompose the problem into two distinct stages:

  1. Coarse Location Proposal (Semantic): Use a VLM to look at the whole image and figure out roughly where the action should happen.
  2. Fine Pose Prediction (Geometric): Use a specialized diffusion model to look only at that specific local region to figure out exactly how the object fits.

Overview of the AnyPlace placement pose prediction approach.

Let’s break down these two stages in detail.

Stage 1: VLM-Guided Coarse Location Prediction

When you look at a drying rack, you immediately see the “spikes” where a cup could go. You don’t need to calculate precise geometry to know that the spikes are the interesting part.

AnyPlace replicates this intuition using a Vision-Language Model (specifically Molmo). The system takes an RGBD (Red-Green-Blue-Depth) image of the scene and a text instruction, such as “Put the vial into the vial rack.”

The pipeline works as follows:

  1. Instruction: The user provides a text prompt.
  2. Segmentation: A model like SAM-2 (Segment Anything Model) isolates the target object (the vial) and the base object (the rack).
  3. VLM Query: The VLM is prompted to identify placement points. For example, the prompt might be “point to the empty positions in the vialplate.”
  4. Cropping: This is the crucial step. Once the VLM outputs a 2D point on the image, the system crops the 3D point cloud to a small region around that point.

By cropping the data, the system discards the “global” noise. The subsequent model doesn’t need to care about the table, the lighting in the corner, or the other objects nearby. It only has to solve a local geometry problem: “How does object A fit into this specific chunk of object B?”

Additional Language Prompts and VLM Output Visualization in the Real-World Evaluation.

As shown in the image above, the VLM is surprisingly good at understanding diverse instructions, whether it’s finding the “base of the drawer,” the “tip of sticks,” or “empty slots” on a plate.

Stage 2: Fine-Grained Placement via Diffusion

Now that the system has a localized “region of interest” (a cropped point cloud), it needs to calculate the precise \(SE(3)\) transformation to place the object.

The authors employ a Diffusion Model for this task. In the context of image generation, diffusion models learn to denoise random static into a coherent image. Here, the concept is adapted for geometry. The model learns to “denoise” a random transformation into a correct placement pose.

The Architecture

The low-level model uses a Transformer-based architecture:

  • Input: Two point clouds—the target object \(\mathcal{P}_c\) and the cropped base region \(\mathcal{P}_{b\_crop}\).
  • Encoder: These point clouds are processed through self-attention layers (to understand their own shape) and cross-attention layers (to understand how the shapes relate to each other).
  • Decoder: A diffusion decoder takes the encoded features and predicts a transformation update.

The Diffusion Process

The process is iterative. It starts with the object at a random rotation and translation relative to the crop.

\[ \mathcal { P } _ { c } ^ { ( 0 ) } = T _ { \mathrm { i n i t } } \mathcal { P } _ { c } , \]

The initialization logic is defined as:

\[ T _ { \mathrm { i n i t } } = ( \mathbf { R , t } ) , \quad \mathbf { R } \sim \mathcal { U } ( \mathrm { S O } ( 3 ) ) , \quad \mathbf { t } \sim \mathcal { U } ( \mathrm { b b o x } ( \mathcal { P } _ { b _ { - } c r o p } ) ) . \]

This essentially says: Start with a random rotation \(\mathbf{R}\) and a random translation \(\mathbf{t}\) somewhere inside the bounding box of the cropped region.

Over several timesteps, the model predicts a delta transformation (a small adjustment) to move the object closer to a valid fit. By repeating this “denoising” step, the object “slides” into place. Because diffusion models are probabilistic, they can represent multimodal distributions. If you run the diffusion process multiple times on the same input, it can discover different valid orientations or fits for the object.

Training on Synthetic Data

One of the standout features of AnyPlace is that this low-level model is trained entirely on synthetic data. The authors built a procedural generation pipeline in Blender and NVIDIA IsaacSim.

Dataset generation and robot performing various placement tasks in simulatio!

They generated thousands of objects and defined three primary task types:

  1. Insertion: Pegs in holes, vials in racks.
  2. Stacking: Boxes on pallets, items on shelves.
  3. Hanging: Mugs on racks, rings on hooks.

Because the data is generated programmatically, the “ground truth” placement poses are mathematically perfect. This allows the model to learn pure geometric relationships—convex fits into concave, flat rests on flat, hook loops over cylinder—which transfer exceptionally well to real-world objects.

Experiments and Results

The authors compared AnyPlace against several strong baselines:

  • NSM (Neural Shape Mating): A regression-based approach.
  • RPDiff: A similar diffusion-based approach but without the VLM-guided local cropping (it looks at the global scene).
  • AnyPlace-EBM: An Energy-Based Model variant of their own system.

Simulation Performance

The simulation results reveal the strengths of the coarse-to-fine architecture.

Table 1: Success rate (%) on synthetic dataset.

In Table 1, notice the “Peg Insertion” column. This is a high-precision task. The single-task AnyPlace model achieves a 30.95% success rate compared to 7.63% for NSM. While these numbers might seem low absolutely (insertion is very hard!), AnyPlace is significantly outperforming the baselines. In “Vial Insertion” (multi-mode), AnyPlace dominates with 92.74% success versus 16-18% for baselines.

Coverage and Multimodality

A key claim of the paper is that AnyPlace can find all valid placement spots, not just one.

Coverage comparison across different models in vial insertion and hanging tasks.

Figure 4 illustrates “Coverage,” which measures how many of the available slots were found across multiple trials.

  • The red line (AnyPlace) shoots up to near 1.0 (100% coverage) almost immediately.
  • The baselines (RPDiff, NSM) struggle to find diverse solutions, often collapsing to a single mode or failing entirely.

Precision

Why does cropping the point cloud matter? It forces the model to focus on the fine details needed for precision.

Errors on insertion tasks.

Figure 5 shows histograms of translation (distance) and rotation errors.

  • NSM (Blue) has a wider spread of errors.
  • AnyPlace (Green) has a tight cluster near zero error, particularly for distance. This sub-centimeter precision is what enables tasks like peg insertion to succeed.

Real-World Evaluation (Sim-to-Real)

The ultimate test is deploying the model on a physical robot with objects it has never seen before. The authors tested 16 distinct real-world tasks.

Coverage rate of the real robot executing three placement tasks.

Table 2 shows real-world coverage. AnyPlace achieves 80% coverage on vial insertion and ring hanging. The baselines (NSM and RPDiff) often fail completely (0%) or achieve very low coverage. This is a damning result for global scene methods—they simply cannot handle the noise and complexity of real-world point clouds as effectively as the local, cropped approach of AnyPlace.

Demonstration of robot executing diverse real-world tasks using AnyPlace predictions.

The qualitative results in Figure 6 are compelling. The robot is shown successfully stacking batteries, hanging rings, and inserting vials. The overlaid point clouds show the model predicting accurate “ghost” poses for the objects before execution.

Failure Analysis

No system is perfect. The paper honestly analyzes failure cases, usually stemming from poor sensing data.

Visualization of Object Point Clouds at Predicted Poses from Different Models.

In Figure A5, we see a comparison of predictions.

  • AnyPlace (Left column): Predicts a clean fit.
  • RPDiff (Third column): Often predicts a pose where the object intersects (clips through) the rack or base. This is a hallmark of a model that hasn’t learned precise local geometry boundaries.

Conclusion and Implications

AnyPlace presents a compelling argument for modular robotic design. Rather than throwing a massive neural network at a raw image and hoping for the best, the authors utilized the specific strengths of different AI architectures:

  1. VLMs act as the “Semantic Eye,” understanding instructions and identifying broad regions of interest.
  2. Diffusion Models act as the “Geometric Brain,” solving the complex 3D puzzle of fitting shapes together.

By using the VLM to crop the input for the diffusion model, the researchers turned a hard global problem into a manageable local one. This allowed them to train on cheap synthetic data and deploy effectively in the real world.

Key Takeaways for Students:

  • Divide and Conquer: Breaking a robotic task into high-level planning (Where?) and low-level control (How?) often yields better generalization than end-to-end learning.
  • Local > Global: For manipulation, what happens 50cm away from the target usually doesn’t matter. Cropping data helps models focus.
  • Synthetic Data is Viable: With enough randomization (domain randomization) and the right abstraction (point clouds), you can train effective policies without real-world demonstrations.

The future of robotic manipulation likely looks like AnyPlace: systems that can read a user’s intent from language, understand the physics of the world through simulation, and execute tasks with the precision of specialized geometric models.