Imagine you are walking down a crowded hallway. A friend calls out to you: “Hey, catch up to Alice, but make sure you pass Bob on his left, and try to stay on the right side of the carpet.”

For a human, this instruction is complex but manageable. We instinctively break it down:

  1. Locate Alice (Goal).
  2. Locate Bob and plan a left-side pass (Constraint A).
  3. Identify the carpet and stay right (Constraint B).

We execute these behaviors simultaneously. However, for a robot, this is a nightmare of combinatorial complexity. Standard robotic learning approaches often try to learn a single policy for every possible scenario. But as the number of potential constraints grows—yielding, following, avoiding, passing left/right—the number of combinations explodes exponentially. Training a robot for every possible permutation of instructions is computationally impossible.

In this post, we are doing a deep dive into ComposableNav, a research paper that proposes an elegant solution to this problem. Instead of learning everything at once, ComposableNav learns individual “atomic” skills (motion primitives) and uses the mathematical properties of Diffusion Models to mathematically “add” them together at runtime.

Figure 1: Instruction-Following Navigation in Dynamic Environments. Given an instruction that specifies how a robot should interact with entities in the scene (a), ComposableNav leverages the composability of diffusion models (b) to compose motion primitives to generate instruction-following trajectories (c).

As shown in Figure 1 above, the system takes a complex instruction, breaks it down into primitives (like “Pass Left” and “Pass Right”), and fuses them into a single, smooth trajectory that satisfies everyone.

The Core Problem: The Combinatorial Explosion

Robots operating in human spaces (social navigation) need to follow natural language instructions. Current approaches usually fall into two buckets:

  1. Rule-based systems: These are brittle and hard to scale.
  2. End-to-end Learning (RL/Imitation Learning): These require massive datasets.

The problem with learning-based approaches is the specification space. If a robot knows 5 different skills, an instruction might ask for any combination of them.

  • “Do A.”
  • “Do A and B.”
  • “Do A, C, and D.”

If you try to train a separate policy for “Doing A while Doing B,” you quickly run out of data and compute.

ComposableNav flips this paradigm. The core intuition is: Follow an instruction by independently satisfying its constituent specifications.

If we treat “Pass Left” and “Yield to Pedestrian” as independent probability distributions over possible trajectories, can we just combine those distributions? With standard neural networks, this is hard. But with Diffusion Models, it turns out to be surprisingly elegant.

Background: Diffusion Models and Robotics

To understand ComposableNav, we need a quick primer on Diffusion Models in the context of robotics.

Standard Diffusion Models (like those used in DALL-E or Midjourney) generate images by reversing a noise process. They start with random Gaussian noise and iteratively “denoise” it to reveal a structure.

In robotics, we treat a trajectory (a sequence of future \((x, y)\) positions) as an image.

  1. Forward Process: Take a good trajectory and add noise until it’s random junk.
  2. Reverse Process: Train a network to predict the noise at each step, allowing us to turn random junk back into a feasible trajectory.

Mathematically, the reverse process at step \(t\) is modeled as a Gaussian distribution:

Equation 3 showing the Gaussian distribution of the reverse diffusion process.

Here, \(f_{\theta}\) is the denoising network (the “brain”) that predicts the noise to remove.

The magic of diffusion models lies in their interpretation as Score-Based Models. The “score” is essentially the gradient of the log-probability of the data. It points in the direction of “higher probability” samples. This property is vital because sums of scores correspond to products of probabilities. This is the mathematical foundation that allows ComposableNav to combine skills.

The ComposableNav Framework

The researchers propose a pipeline that consists of two distinct phases: Training (learning the skills) and Deployment (combining the skills).

Figure 2: ComposableNav Overview. Illustrating the two-stage training procedure (Pre-training and RL Fine-tuning) and the deployment phase where primitives are composed.

Phase 1: Training Motion Primitives

One of the biggest hurdles in robot learning is getting data. Collecting a dataset of a robot “passing someone on the left” is tedious. Collecting a dataset for “passing on the left while avoiding a puddle” is even harder.

ComposableNav circumvents this by using a Two-Stage Training Procedure that doesn’t require specific demonstrations for every skill.

Stage 1: Supervised Pre-training (The Base Model)

First, the researchers train a Base Diffusion Model. This model isn’t social; it just knows how to drive.

  • Data: They generate synthetic data of collision-free trajectories in dynamic environments using standard geometric planners (like Hybrid A*).
  • Goal: Learn the physics of the robot, smoothness, and basic obstacle avoidance.
  • Result: A model that generates valid paths, but doesn’t follow specific instructions like “yield” or “follow.”

Stage 2: RL Fine-tuning (The Skills)

This is where the system gets smart. Instead of needing demonstrations for “Following,” they use Reinforcement Learning (RL).

Why RL? Because verifying a behavior is easier than demonstrating it. It is very easy to write a code snippet that says: If the robot is behind the person, give +1 reward.

The researchers use Denoising Diffusion Policy Optimization (DDPO). They treat the diffusion denoising steps as a Markov Decision Process (MDP).

Equation defining the MDP structure for the diffusion process.

The “action” at each step is the denoised trajectory segment. The “reward” is given only at the end (\(t=0\)) based on whether the final trajectory satisfied the instruction (e.g., did it actually pass on the left?).

The objective function for this fine-tuning looks like this:

Equation 7: The policy gradient objective used in DDPO.

By maximizing this objective, the Base Model is “molded” into separate versions. One version becomes an expert at “Passing Left,” another at “Yielding,” another at “Following.” These are the Motion Primitives.

Phase 2: Deployment via Composition

Now we have a library of diffusion models, each specialized for a single task. At deployment time, the robot receives a complex instruction like:

“Overtake the pedestrian while staying on the right side of the road.”

This instruction \(I\) is decomposed (e.g., by an LLM) into specifications: \(\phi^{(1)}\) (Overtake) and \(\phi^{(2)}\) (Stay Right).

The robot needs to generate a trajectory \(\tau\) that satisfies both. In probability terms, we want to sample from the joint conditional distribution. ComposableNav assumes these specifications are conditionally independent:

Equation 9: Derivation of the conditional distribution factorization.

This equation essentially says: The probability of a trajectory satisfying both conditions is proportional to the product of the probabilities of satisfying each condition individually, scaled by the base distribution.

The “Score” Trick: Because we are working with diffusion models, we are operating in the “score space” (gradients of log-probability). When you multiply probabilities, you add their logarithms.

Therefore, to generate a trajectory that satisfies multiple primitives, we simply sum the noise predictions from the relevant diffusion models.

Equation 10: The formula for composing noise predictions from multiple models.

In this equation:

  • \(\hat{\epsilon}\) is the final combined noise used to update the trajectory.
  • \(f_{\theta}^{\phi^{(i)}}\) is the noise predicted by the \(i\)-th primitive (e.g., the “Pass Left” model).
  • The system iteratively denoises the trajectory, guided simultaneously by the “gradient” of every required skill.

Experiments: Does it work?

The researchers evaluated ComposableNav in both rigorous simulations and on a physical Clearpath Jackal robot. They compared it against baseline methods that use Vision-Language Models (VLMs) directly for planning (like VLM-Social-Nav) or generate costmaps (BehAV).

1. Can it learn the skills without demos?

Yes. The two-stage training (Pre-training + RL) proved highly effective.

Table 2: Comparison showing how fine-tuning significantly improves success rates over the pre-trained base model.

As seen in Table 2, the Base (Pre-trained) model has low success rates for specific social tasks (like “Following” at 27%), which is expected—it wasn’t trained for them. After RL fine-tuning, the primitives achieve nearly 100% success on their specific tasks.

2. Can it handle unseen combinations?

This is the main claim of the paper. The researchers tested the robot with combinations of 2, 3, and 4 simultaneous constraints—combinations the model had never seen during training.

Figure 4: Bar plots showing success rates. ComposableNav maintains performance as complexity increases, while baselines collapse.

Figure 4 paints a stark picture.

  • 1 Specification: All methods do reasonably well.
  • 3 & 4 Specifications: The baselines collapse. VLM-based methods drop to 0% success because the reasoning burden becomes too high, or the generated costmaps conflict.
  • ComposableNav: Maintains robust performance (e.g., ~35% success on 4 specs vs <10% for baselines). While performance does drop with complexity, it remains far more viable than alternatives.

We can see the qualitative difference in the trajectories below. ComposableNav (blue lines) creates smooth paths that weave between constraints, whereas baselines often get stuck or violate safety rules.

Figure 7: Qualitative comparison showing ComposableNav succeeding where other methods fail to yield or avoid regions.

3. Real-World Deployment

The team deployed the system on a Clearpath Jackal robot equipped with LiDAR and cameras.

Figure 8: Robot Setup showing the Clearpath Jackal with Zed 2i camera and Ouster Lidar.

The real-world experiments highlighted a key benefit: Customizability. In one scenario (Figure 5 below), the robot is approaching a doorway with a human.

  • Instruction A: “Follow the person.” The robot politely tucks in behind (Green path).
  • Instruction B: “Enter before the person.” The robot accelerates to overtake (not shown in this specific crop, but supported by the model).

This capability allows users to tailor robot behavior to social preferences on the fly without retraining.

Figure 5: Real world scenarios showing how the robot adapts its trajectory (green) versus the default behavior (yellow) based on instructions.

The system also runs in real-time. By utilizing Model Predictive Control (MPC) to track the diffusion trajectory and a smart “re-planning” strategy that partially denoises the previous plan, they achieved low latency suitable for dynamic interaction.

Table 4: Inference latency table showing real-time performance.

As shown in Table 4, even with 4 composed primitives, the replanning step takes only 0.060 seconds, which is well within the 10Hz control loop of the robot.

Conclusion and Implications

ComposableNav represents a significant step forward in Generalized Robotic Autonomy. The key takeaway is that we don’t need to teach robots every possible combination of tasks. By learning robust, independent skills (primitives) via Diffusion Models, we can treat behaviors like arithmetic: simple addition creates complex, nuanced results.

The approach solves two major bottlenecks:

  1. Data Scarcity: By using RL fine-tuning, we avoid the need for complex demonstration datasets for every skill.
  2. Combinatorial Complexity: By composing skills at inference time, we can handle exponentially many instructions with a linear number of models.

While limitations exist—performance does degrade with very high numbers of constraints, and the system currently relies on hand-crafted reward functions for the primitives—this “compositional” mindset is likely the future of adaptable, instruction-following robots.