Introduction

Imagine you are teaching a robot to wipe a whiteboard. You show it the motion, and it learns to mimic the trajectory perfectly. But then you notice a stubborn stain. You tell the robot, “Wipe harder,” or perhaps “Wipe faster.” In a typical robotic system, this is where things break down. Most imitation learning models treat a task as a static sequence: they learn what to do, but they struggle to adapt how they do it based on qualitative feedback during execution.

As Large Language Models (LLMs) enter the robotics space, we are getting better at giving high-level commands like “Pick up the apple.” However, bridging the gap between a high-level command and the fine-grained, continuous control of a robot’s muscles—its speed, force, and smoothness—remains a significant hurdle.

In the paper “Imitation Learning Based on Disentangled Representation Learning of Behavioral Characteristics,” researchers from Saitama University and the University of Tsukuba propose a novel solution. They have developed a motion generation model that allows a human to adjust a robot’s behavior in real-time using “modifier directives”—instructions like “strong,” “weak,” “fast,” or “slow.”

Overview of the proposed method. It generates the next motion trajectory based on a human-given modifier directive and the current robot state.

As shown in Figure 1, the system takes human input (depicted as sliders for physical and temporal parameters) and modifies the robot’s trajectory on the fly. This blog post will dive deep into how they achieved this by combining imitation learning with disentangled representation learning, allowing robots to separate the “what” of a task from the “how.”

The Challenge: The “Black Box” of Latent Spaces

To understand the innovation here, we first need to look at how modern robots learn from demonstration. A popular approach involves Variational Autoencoders (VAEs).

In a standard setup, a robot watches a human perform a task (like wiping a board). The VAE compresses this high-dimensional motion data into a low-dimensional “latent space”—a compact numerical summary of the movement. Later, the robot samples from this latent space to reconstruct the motion.

The problem is that this latent space is usually a “black box.” The variables inside are entangled. If you try to tweak one number in the latent space to make the robot move faster, it might also make the robot move efficiently or change its trajectory entirely. The robot hasn’t learned “speed” as a separate concept; it just learned a messy combination of features.

The researchers address this by using Disentangled Representation Learning (DRL). Their goal is to organize the latent space so that specific variables correspond to specific physical concepts (like force or speed), while the rest handle the general motion.

The Proposed Method

The core of this research is a modified Conditional VAE (CVAE) that allows for online motion generation—meaning the robot can adjust its path while it is moving, not just before it starts.

1. Data Collection with “Weak” Labels

Training a robot to understand “force” usually requires expensive sensors and precise physics data. However, humans don’t think in Newtons; we think in qualitative terms.

The authors used a teleoperation system (bilateral control) where a human controls a follower robot. This allows them to record not just position, but also the force applied during contact tasks.

Data collection through bilateral control and Weakly supervised labeling of modifier directive.

Crucially, they used weak supervision. Instead of labeling every millisecond of data with exact force measurements, the demonstrator simply labels a whole sequence with a qualitative tag:

  • Physical: Weak (\(0.0\)), Moderate (\(0.5\)), Strong (\(1.0\))
  • Temporal: Slow (\(0.0\)), Moderate (\(0.5\)), Fast (\(1.0\))

This approach makes data collection much easier and more intuitive, as it aligns with how humans naturally give instructions.

2. The Learning Architecture

The architecture is designed to split the latent space into two distinct parts. Let’s look at the system overview:

Overview of the offline learning architecture and online inference.

In Figure 2(A), you see the offline training process. The model takes the current robot state (\(s_t\)) and a sequence of future actions (\(A_t\)). It compresses this into a latent variable \(z\).

Here is the key innovation: The latent variable \(z\) is mathematically forced to split into two groups:

  1. Constrained Variables (\(z^c\)): These are specifically trained to represent the modifier directives (speed, force).
  2. Unconstrained Variables (\(z^u\)): These capture everything else needed to perform the task (trajectory, geometry) that isn’t related to the modifiers. \[ \boldsymbol { z } = \{ z _ { s } ^ { c } , z _ { n } ^ { u } \} = \{ \underbrace { z _ { 1 } ^ { c } , \dots , z _ { S } ^ { c } } _ { \mathrm { ~ \normalfont ~ \left. ~ \right\} } , z _ { 1 } ^ { u } , \dots , z _ { N } ^ { u } } \] To enforce this separation, the constrained variables (\(z^c\)) are passed through a small classifier (an MLP) that tries to predict the weak label (e.g., “Strong” or “Fast”). If the variable \(z^c\) doesn’t contain enough information to predict the label, the model is penalized.

The Loss Function

The training involves balancing three different objectives (loss functions), as defined below:

\[ \begin{array} { c l } { \displaystyle \mathcal { L } _ { s } = \mathcal { L } _ { b c e } \big ( y _ { s } , \hat { y } _ { s } \big ) = - \big [ y _ { s } \cdot \log ( \sigma ( \hat { y } _ { s } ) ) + ( 1 - y _ { s } ) \cdot \log ( 1 - \sigma ( \hat { y } _ { s } ) ) \big ] } \\ { \displaystyle \mathcal { L } _ { m o d i } = \sum _ { s = 1 } ^ { S } \mathcal { L } _ { s } } \\ { \displaystyle \mathcal { L } = \alpha \mathcal { L } _ { r e c } + \beta \mathcal { L } _ { k l } + \gamma \mathcal { L } _ { m o d i } } \end{array} \]
  1. \(\mathcal{L}_{rec}\) (Reconstruction Loss): Ensures the robot can actually perform the motion.
  2. \(\mathcal{L}_{kl}\) (KL Divergence): Keeps the latent space organized and smooth (standard VAE practice).
  3. \(\mathcal{L}_{modi}\) (Modifier Loss): This is the new component. It forces the constrained part of the latent space to accurately predict the “weak label” (the human instruction).

By optimizing these three together, the model learns a latent space where you can manually tweak \(z^c\) to control speed or force without breaking the motion itself.

3. Online Inference and Action Chunking

Once trained, we move to Figure 2(B) (Online Inference). Here, the human operator acts as the controller. By adjusting a slider, they feed a specific value into \(z^c\) (e.g., setting the “force” variable to 1.0). The unconstrained variables \(z^u\) are set to 0.

However, changing a control input in the middle of a motion can cause the robot to jerk or jitter. To solve this, the authors utilize Action Chunking. Instead of predicting just the next timestep, the model predicts a short sequence (a “chunk”) of future actions.

To make the transition smooth when the human changes the command, the robot calculates the next position as a weighted average of previous predictions.

\[ \hat { \pmb { s } } _ { t + 1 } = \frac { \sum _ { i = 1 } ^ { \operatorname* { m i n } ( t , W - 1 ) } w _ { i } \hat { \pmb { A } } _ { t + 1 - i } [ i ] } { \sum _ { i = 1 } ^ { \operatorname* { m i n } ( t , W - 1 ) } w _ { i } } , \qquad w _ { i } = \frac { 1 } { \log ( i + 1 ) } . \]

The weighting function \(w_i = 1 / \log(i+1)\) gives more importance to earlier predictions while slowly blending in the new instructions. As we will see in the experiments, this specific weighting scheme is critical for stability.

Experiments and Results

The authors evaluated their method using a Wiping Task and a Pick-and-Place Task using a CRANE-X7 robot arm.

The Wiping Task

The primary experiment involved wiping a whiteboard. This task is ideal because it requires continuous contact and has clear qualitative variations: speed (temporal) and pressure (physical).

Wiping task: the robot grabbed the whiteboard eraser and uses its entire body and joints to wipe the whiteboard.

The researchers compared their proposed method (incorporating the disentangled loss) against a standard CVAE-LSTM and ACT (Action Chunking with Transformers) without the disentanglement constraints.

Did the Robot Listen?

The researchers measured Modifier Directive Errors (MDE)—a metric calculating how well the robot’s actual motion (force/speed) aligned with the commanded latent variable. A lower MDE means better control.

Success rate and modifier directive conformity Index in the Wiping task

Table 2 shows the results.

  • CVAE-LSTM (Proposed): Achieved a 100% Task Success Rate. More importantly, look at the \(z_2\) (temporal) column. The MDE is 0.22, significantly lower than the standard CVAE-LSTM (1.06). This proves that the latent variable \(z_2\) successfully captured the concept of “speed.”
  • Entanglement: Ideally, changing the “speed” variable shouldn’t change the “force.” The proposed method improved this separation, though some entanglement remained (e.g., changing speed had a small effect on force, but much less than in the baseline models).

The Importance of Smoothing

One of the paper’s critical findings was about the stability of online adjustments. Because the human can change commands at any millisecond, the robot needs to blend these commands smoothly.

Relationship between weight parameters and task success rate (TSR in action chunking

Table 3 compares different weighting strategies for Action Chunking.

  • No weight: The robot failed completely (0% success) because the motion oscillated wildly.
  • Proposed Weight (\(1/\log(i+1)\)): Achieved 100% success.

This demonstrates that simply having a disentangled latent space isn’t enough; you also need a robust mechanism to synthesize these commands into a smooth trajectory in real-time.

Limitations: The Pick-and-Place Task

To test generalization, the authors tried a Pick-and-Place task, involving spatial directives (placing an object Left, Center, or Right).

Top view of workspace for the Pick-and-Place task.

While the robot completed the tasks successfully, the disentanglement was less effective. The spatial instructions (Left/Center/Right) are discrete and symbolic, unlike the continuous nature of speed or force.

The results showed that their method works best for continuous, dynamic characteristics (like “faster” or “harder”) rather than discrete logic (like “go left”). This suggests that different types of instructions might need different architectural approaches.

Conclusion

This paper presents a significant step forward in making imitation learning more interactive and adaptive. By using Disentangled Representation Learning, the authors successfully created a system where:

  1. Qualitative instructions (“strong,” “fast”) are mapped to specific axes in the latent space.
  2. Weak supervision allows for easy training data collection without complex sensors.
  3. Online inference with weighted action chunking allows humans to guide the robot in real-time without causing instability.

The implications are exciting. Instead of retraining a robot for every slight variation of a task, we can teach it the general motion and then “tune” it like a radio—turning up the speed knob or turning down the force knob—until the behavior is exactly what we need. While challenges remain with discrete, symbolic instructions, this work paves the way for robots that are not just automated playback machines, but responsive collaborators.