Introduction
In the world of industrial automation, robots are incredibly proficient at repeating identical tasks in structured environments. A robot arm in an automotive factory can weld the same spot on a chassis millions of times with sub-millimeter precision. However, the moment you ask that same robot to assemble a piece of furniture—where parts might be scattered randomly, fit tightly, or require complex sequencing—the system often fails.
This creates a significant gap in robotics: the inability to handle long-horizon, contact-rich assembly tasks. “Long-horizon” means the robot must perform a sequence of many dependent steps (e.g., grasp leg A, reorient leg A, insert leg A, repeat for legs B, C, and D). “Contact-rich” implies that the parts interact physically with friction and force, requiring a finesse that pure position control cannot achieve.
Current approaches usually fall into one of two extremes. Reinforcement Learning (RL) is great at learning specific, contact-heavy skills like inserting a peg, but it struggles to learn long sequences because the search space becomes too vast. Conversely, Imitation Learning (IL) can learn sequences from human demonstrations, but it typically requires massive datasets and struggles with the high precision needed for tight assemblies.
In the paper “ARCH: Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly,” researchers from Stanford, MIT, and Autodesk Research propose a solution that bridges these gaps. They introduce ARCH (Adaptive Robotic Compositional Hierarchy), a framework that doesn’t just rely on one learning method. Instead, it hybridizes classical Motion Planning (MP) with Reinforcement Learning, all orchestrated by a high-level “brain” trained via Imitation Learning.

As shown in Figure 1, the system is tested on complex tasks like assembling a multi-part beam or a 9-part stool. The results are compelling: ARCH achieves high success rates with high data efficiency, requiring only a handful of human demonstrations.
Background: The Assembly Challenge
To appreciate the innovation behind ARCH, we must first understand why this problem is so difficult.
The Limits of End-to-End Learning
In an ideal world, we would train a single neural network that takes camera images as input and outputs robot motor torques. This is “end-to-end” learning.
- End-to-End Imitation Learning: A human operates the robot (teleoperation) to perform the task hundreds of times. The robot mimics this behavior. However, for precise assembly, collecting enough high-quality data is expensive and time-consuming.
- End-to-End Reinforcement Learning: The robot tries to solve the task by trial and error. For a 30-second assembly task, the number of possible actions is astronomical. The robot rarely succeeds by chance, leading to “sparse rewards,” making learning nearly impossible.
The Hierarchical Approach
A common solution to the complexity of long-horizon tasks is Hierarchical Reinforcement Learning. The idea is to break the problem down: a “manager” (high-level policy) decides what to do (e.g., “pick up the wrench”), and a “worker” (low-level policy) decides how to do it (e.g., specific joint movements).
ARCH adopts this hierarchical structure but introduces a critical twist: the low-level workers are not all the same. Some are learned, and some are algorithmic.
Core Method: The ARCH Framework
The authors model the assembly task as a Parameterized-Action Markov Decision Process (PAMDP). In simple terms, this means the robot must decide which primitive action to take (e.g., Grasp, Insert) and what parameters to use for that action (e.g., the exact coordinate to grasp).
The architecture is split into three main components: the Hybrid Low-Level Primitive Library, the High-Level Policy, and the Perception System.

1. The Hybrid Low-Level Primitive Library
This is one of the paper’s most pragmatic contributions. The researchers recognized that we don’t need deep learning for everything. If a robot needs to move its hand through empty space, classical algorithms are already perfect at that. However, if a robot needs to jam a tight peg into a hole, classical algorithms struggle with the complex physics of friction and jamming.
Therefore, ARCH uses a Hybrid Library:
Motion Planning (MP) Primitives
For tasks that occur in free space, ARCH uses classical motion planners.
- GRASP: Uses a planner to move the gripper to a grasp pose and close it.
- PLACE / MOVE: Uses a planner to move the end-effector to a target pose.
Reinforcement Learning (RL) Primitives
For tasks involving contact, ARCH uses policies trained via RL.
- INSERT: This is the critical skill. Insertion requires reacting to forces. If the peg hits the rim of the hole, the robot must adjust. An RL policy is trained in simulation to handle this.
The objective function for the RL-based insertion policy is defined as:

Here, the policy optimizes the trajectory to minimize the distance between the current pose and the goal pose (\(g\)), effectively learning how to wiggle and push the object into place. This policy is trained once in simulation and transferred to the real world (Sim-to-Real).
2. The High-Level Policy
While the low-level primitives handle the physical execution, the High-Level Policy acts as the conductor. It decides the sequence of operations.
The authors use a Diffusion Transformer (DiT) for this role. Diffusion models are generative models (famous for generating images like Stable Diffusion) that are excellent at capturing multi-modal distributions. In this context, the DiT takes the current state (robot joint angles + object positions) and predicts the best primitive to execute next.
Crucially, this high-level policy is trained via Imitation Learning from a very small number of human demonstrations. Because the “actions” are high-level commands (e.g., “Insert Object 1”) rather than low-level motor twitches, the search space is small, and the model learns very quickly.
The imitation learning objective is to find parameters \(\theta\) that maximize the likelihood of the expert’s choices:

3. Pose Estimation with CPPF++
For the high-level policy to make good decisions, it needs to know where the parts are. The authors employ a robust pose estimation pipeline. They adapt a method called CPPF++ (Category-level Pose estimation via Point-wise Features).
To ensure the high precision required for assembly, they add a post-optimization step. They compute the Chamfer Distance between the observed point cloud and the CAD model of the object, refining the estimated pose iteratively.

This step aligns the known 3D model of the part with the camera data, correcting for small errors. As seen below, the system can accurately detect the position and orientation of various geometric shapes.

Experiments and Results
The team evaluated ARCH on both a real robot (UR10e) and in simulation (IsaacLab). They designed three challenging tasks:
- FMB Assembly (Real World): Assembling 9 different geometric objects into a board.
- 5-Part Beam Assembly (Simulation): Connecting legs and feet to a central beam.
- 9-Part Stool Assembly (Simulation): A complex, multi-stage furniture assembly.

Comparative Performance
The researchers compared ARCH against several strong baselines, including End-to-End RL, End-to-End Diffusion Policy (IL), and other hierarchical methods like MimicPlay.
The results, summarized in Table 1, are stark.

Key Takeaways from the Data:
- End-to-End Failure: Both pure RL and pure IL (Diffusion Policy) failed completely (0% success) on the long-horizon tasks. The tasks were simply too long and complex for unstructured learning.
- Hierarchical Superiority: While other hierarchical methods (MimicPlay, Luo et al.) managed some success (10-25%), they struggled with the contact-rich insertion phases.
- ARCH’s Dominance: ARCH achieved success rates between 45% and 55%. While this isn’t 100%, it is surprisingly high given the complexity and the fact that it was trained with only 10 demonstrations.
- Upper Bound: The “Human Oracle” represents the theoretical maximum if the high-level choices were perfect. ARCH performs competitively close to this upper bound, suggesting the remaining failures are mostly due to physical execution (e.g., a gripper slip) rather than bad decision-making.
To rigorously measure efficiency, the authors used a metric called Success Weighted by Path Length (SPL):

This metric ensures the robot isn’t just succeeding by taking a chaotic, inefficient route. It rewards success achieved via the optimal number of steps. ARCH consistently scored highest on SPL.
Generalization to Unseen Objects
A major claim of the paper is the system’s ability to generalize. Can a robot trained to insert a hexagon also insert a star or a circle?
The answer appears to be yes. The high-level policy was trained on limited objects, yet it successfully manipulated “unseen” objects during testing.

Table 2 highlights that while some shapes are harder than others (e.g., the “SquareCircle” is tricky to grasp), the system maintains a respectable success rate across novel geometries without retraining. This robustness comes from the generalized nature of the RL insertion primitive—pushing a peg into a hole feels similar regardless of the peg’s shape, provided the goal pose is correct.
Conclusion and Implications
The ARCH framework represents a pragmatic step forward in robotic assembly. By acknowledging that not everything needs to be learned, the authors created a system that is both data-efficient and precise.
Here is a summary of why this approach works:
- Hybridization is Efficient: Using Motion Planning for movement saves the neural networks from learning basic physics. Using RL for insertion solves the contact problem that Motion Planning can’t handle.
- Hierarchy Reduces Complexity: The high-level policy only has to choose “which skill” to use, not “which motor torque” to apply. This shrinks the search space, allowing the robot to learn from just 10 human demos.
- Simulation Training: The difficult contact skills (RL) are learned in simulation, safe from physical damage and time constraints, and then transferred to the real robot.
For students and researchers, ARCH illustrates the power of modular design. Rather than treating the robot as a black box to be trained end-to-end, decomposing the problem into planning, perception, and skill-specific learning yields a system capable of tackling the kind of long-horizon, messy tasks that have traditionally kept robots confined to the assembly line.
Hyperparameters
For those interested in reproducing the High-Level Policy (DiT), the authors provided their configuration:

ARCH demonstrates that the future of robotic assembly likely isn’t pure AI or pure engineering—it’s a carefully architected combination of both.
](https://deep-paper.org/en/paper/2409.16451/images/cover.png)