Introduction

In the world of industrial automation, robots are incredibly proficient at repeating identical tasks in structured environments. A robot arm in an automotive factory can weld the same spot on a chassis millions of times with sub-millimeter precision. However, the moment you ask that same robot to assemble a piece of furniture—where parts might be scattered randomly, fit tightly, or require complex sequencing—the system often fails.

This creates a significant gap in robotics: the inability to handle long-horizon, contact-rich assembly tasks. “Long-horizon” means the robot must perform a sequence of many dependent steps (e.g., grasp leg A, reorient leg A, insert leg A, repeat for legs B, C, and D). “Contact-rich” implies that the parts interact physically with friction and force, requiring a finesse that pure position control cannot achieve.

Current approaches usually fall into one of two extremes. Reinforcement Learning (RL) is great at learning specific, contact-heavy skills like inserting a peg, but it struggles to learn long sequences because the search space becomes too vast. Conversely, Imitation Learning (IL) can learn sequences from human demonstrations, but it typically requires massive datasets and struggles with the high precision needed for tight assemblies.

In the paper “ARCH: Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly,” researchers from Stanford, MIT, and Autodesk Research propose a solution that bridges these gaps. They introduce ARCH (Adaptive Robotic Compositional Hierarchy), a framework that doesn’t just rely on one learning method. Instead, it hybridizes classical Motion Planning (MP) with Reinforcement Learning, all orchestrated by a high-level “brain” trained via Imitation Learning.

Figure 1 shows the experimental setup including a UR10e robot, the insertion board, and the specific assembly tasks like the Beam and Stool assembly.

As shown in Figure 1, the system is tested on complex tasks like assembling a multi-part beam or a 9-part stool. The results are compelling: ARCH achieves high success rates with high data efficiency, requiring only a handful of human demonstrations.

Background: The Assembly Challenge

To appreciate the innovation behind ARCH, we must first understand why this problem is so difficult.

The Limits of End-to-End Learning

In an ideal world, we would train a single neural network that takes camera images as input and outputs robot motor torques. This is “end-to-end” learning.

  • End-to-End Imitation Learning: A human operates the robot (teleoperation) to perform the task hundreds of times. The robot mimics this behavior. However, for precise assembly, collecting enough high-quality data is expensive and time-consuming.
  • End-to-End Reinforcement Learning: The robot tries to solve the task by trial and error. For a 30-second assembly task, the number of possible actions is astronomical. The robot rarely succeeds by chance, leading to “sparse rewards,” making learning nearly impossible.

The Hierarchical Approach

A common solution to the complexity of long-horizon tasks is Hierarchical Reinforcement Learning. The idea is to break the problem down: a “manager” (high-level policy) decides what to do (e.g., “pick up the wrench”), and a “worker” (low-level policy) decides how to do it (e.g., specific joint movements).

ARCH adopts this hierarchical structure but introduces a critical twist: the low-level workers are not all the same. Some are learned, and some are algorithmic.

Core Method: The ARCH Framework

The authors model the assembly task as a Parameterized-Action Markov Decision Process (PAMDP). In simple terms, this means the robot must decide which primitive action to take (e.g., Grasp, Insert) and what parameters to use for that action (e.g., the exact coordinate to grasp).

The architecture is split into three main components: the Hybrid Low-Level Primitive Library, the High-Level Policy, and the Perception System.

Figure 2 illustrates the ARCH framework. A high-level policy selects primitives based on observations. These primitives are executed by either MP or RL policies.

1. The Hybrid Low-Level Primitive Library

This is one of the paper’s most pragmatic contributions. The researchers recognized that we don’t need deep learning for everything. If a robot needs to move its hand through empty space, classical algorithms are already perfect at that. However, if a robot needs to jam a tight peg into a hole, classical algorithms struggle with the complex physics of friction and jamming.

Therefore, ARCH uses a Hybrid Library:

Motion Planning (MP) Primitives

For tasks that occur in free space, ARCH uses classical motion planners.

  • GRASP: Uses a planner to move the gripper to a grasp pose and close it.
  • PLACE / MOVE: Uses a planner to move the end-effector to a target pose.

Reinforcement Learning (RL) Primitives

For tasks involving contact, ARCH uses policies trained via RL.

  • INSERT: This is the critical skill. Insertion requires reacting to forces. If the peg hits the rim of the hole, the robot must adjust. An RL policy is trained in simulation to handle this.

The objective function for the RL-based insertion policy is defined as:

Equation for the RL policy objective function maximizing cumulative return based on distance to goal.

Here, the policy optimizes the trajectory to minimize the distance between the current pose and the goal pose (\(g\)), effectively learning how to wiggle and push the object into place. This policy is trained once in simulation and transferred to the real world (Sim-to-Real).

2. The High-Level Policy

While the low-level primitives handle the physical execution, the High-Level Policy acts as the conductor. It decides the sequence of operations.

The authors use a Diffusion Transformer (DiT) for this role. Diffusion models are generative models (famous for generating images like Stable Diffusion) that are excellent at capturing multi-modal distributions. In this context, the DiT takes the current state (robot joint angles + object positions) and predicts the best primitive to execute next.

Crucially, this high-level policy is trained via Imitation Learning from a very small number of human demonstrations. Because the “actions” are high-level commands (e.g., “Insert Object 1”) rather than low-level motor twitches, the search space is small, and the model learns very quickly.

The imitation learning objective is to find parameters \(\theta\) that maximize the likelihood of the expert’s choices:

Equation for the Imitation Learning objective maximizing the likelihood of expert demonstrations.

3. Pose Estimation with CPPF++

For the high-level policy to make good decisions, it needs to know where the parts are. The authors employ a robust pose estimation pipeline. They adapt a method called CPPF++ (Category-level Pose estimation via Point-wise Features).

To ensure the high precision required for assembly, they add a post-optimization step. They compute the Chamfer Distance between the observed point cloud and the CAD model of the object, refining the estimated pose iteratively.

Equation for the one-way Chamfer distance used to refine pose estimation.

This step aligns the known 3D model of the part with the camera data, correcting for small errors. As seen below, the system can accurately detect the position and orientation of various geometric shapes.

Figure 4 shows qualitative results of the pose estimation on different shapes like squares, ovals, and beams.

Experiments and Results

The team evaluated ARCH on both a real robot (UR10e) and in simulation (IsaacLab). They designed three challenging tasks:

  1. FMB Assembly (Real World): Assembling 9 different geometric objects into a board.
  2. 5-Part Beam Assembly (Simulation): Connecting legs and feet to a central beam.
  3. 9-Part Stool Assembly (Simulation): A complex, multi-stage furniture assembly.

Figure 3 shows the simulation environments for the Beam and Stool assembly tasks.

Comparative Performance

The researchers compared ARCH against several strong baselines, including End-to-End RL, End-to-End Diffusion Policy (IL), and other hierarchical methods like MimicPlay.

The results, summarized in Table 1, are stark.

Table 1 compares the Success Rate (SR) and SPL of ARCH against baselines. ARCH significantly outperforms others.

Key Takeaways from the Data:

  • End-to-End Failure: Both pure RL and pure IL (Diffusion Policy) failed completely (0% success) on the long-horizon tasks. The tasks were simply too long and complex for unstructured learning.
  • Hierarchical Superiority: While other hierarchical methods (MimicPlay, Luo et al.) managed some success (10-25%), they struggled with the contact-rich insertion phases.
  • ARCH’s Dominance: ARCH achieved success rates between 45% and 55%. While this isn’t 100%, it is surprisingly high given the complexity and the fact that it was trained with only 10 demonstrations.
  • Upper Bound: The “Human Oracle” represents the theoretical maximum if the high-level choices were perfect. ARCH performs competitively close to this upper bound, suggesting the remaining failures are mostly due to physical execution (e.g., a gripper slip) rather than bad decision-making.

To rigorously measure efficiency, the authors used a metric called Success Weighted by Path Length (SPL):

Equation for Success Weighted by Path Length (SPL).

This metric ensures the robot isn’t just succeeding by taking a chaotic, inefficient route. It rewards success achieved via the optimal number of steps. ARCH consistently scored highest on SPL.

Generalization to Unseen Objects

A major claim of the paper is the system’s ability to generalize. Can a robot trained to insert a hexagon also insert a star or a circle?

The answer appears to be yes. The high-level policy was trained on limited objects, yet it successfully manipulated “unseen” objects during testing.

Table 2 shows the success rates broken down by object type, demonstrating generalization to unseen shapes.

Table 2 highlights that while some shapes are harder than others (e.g., the “SquareCircle” is tricky to grasp), the system maintains a respectable success rate across novel geometries without retraining. This robustness comes from the generalized nature of the RL insertion primitive—pushing a peg into a hole feels similar regardless of the peg’s shape, provided the goal pose is correct.

Conclusion and Implications

The ARCH framework represents a pragmatic step forward in robotic assembly. By acknowledging that not everything needs to be learned, the authors created a system that is both data-efficient and precise.

Here is a summary of why this approach works:

  1. Hybridization is Efficient: Using Motion Planning for movement saves the neural networks from learning basic physics. Using RL for insertion solves the contact problem that Motion Planning can’t handle.
  2. Hierarchy Reduces Complexity: The high-level policy only has to choose “which skill” to use, not “which motor torque” to apply. This shrinks the search space, allowing the robot to learn from just 10 human demos.
  3. Simulation Training: The difficult contact skills (RL) are learned in simulation, safe from physical damage and time constraints, and then transferred to the real robot.

For students and researchers, ARCH illustrates the power of modular design. Rather than treating the robot as a black box to be trained end-to-end, decomposing the problem into planning, perception, and skill-specific learning yields a system capable of tackling the kind of long-horizon, messy tasks that have traditionally kept robots confined to the assembly line.

Hyperparameters

For those interested in reproducing the High-Level Policy (DiT), the authors provided their configuration:

Table 3 lists hyperparameters for the high-level policy, including hidden dimensions and block counts.

ARCH demonstrates that the future of robotic assembly likely isn’t pure AI or pure engineering—it’s a carefully architected combination of both.