Introduction
In the quest to build general-purpose robots, Behavior Cloning (BC) has been a dominant strategy. The premise is simple: collect a massive amount of data showing a human performing a task, and train a neural network to copy those actions given the current visual observation. With the rise of expressive models like Diffusion Policies and Transformers, robots have become remarkably good at mimicking complex movements.
However, there is a catch. Behavior Cloning tends to be a “pixel-perfect” copycat. If you train a robot to pick up a mug with the camera fixed at a 45-degree angle, and then you bump the camera slightly to the left, the policy often fails catastrophically. The robot hasn’t learned how to pick up the mug; it has learned how to react to a specific arrangement of pixels.
This fragility creates a massive bottleneck. We want to train robots on large, diverse datasets collected from different viewpoints, lighting conditions, and environments (what researchers call heterogeneous data). But when standard BC is applied to this messy data, it struggles to find the shared structure. It overfits to the specific visual details of individual demonstrations rather than learning the underlying behavior.
In this post, we will dive deep into a solution called CLASS (Contrastive Learning via Action Sequence Supervision). This method flips the script on representation learning. Instead of just looking at the images, CLASS asks: “regardless of what the camera sees, are the actions that follow similar?” By aligning visual representations based on the similarity of future action sequences, CLASS allows robots to ignore visual noise (like camera shifts) and focus on the task at hand.

The Problem: Visual Overfitting in Behavior Cloning
To understand why CLASS is necessary, we first need to understand the limitation of standard imitation learning.
In a typical setup, a robot observes a state \(o_t\) (usually an image) and predicts an action \(a_t\). Modern approaches often predict a sequence of future actions (action chunking) to ensure smooth motion. However, the connection between \(o_t\) and the action sequence is learned via direct supervision.
If you have two demonstrations of the same task—say, stacking a block—recorded from two different camera angles, the pixel inputs are completely different. A standard encoder (like a ResNet) might map these two images to very different places in the “latent space” (the internal numerical representation of the image). Consequently, the policy has to learn two separate mappings for the exact same physical behavior. This is inefficient and leads to overfitting. The model memorizes the background and camera angle rather than the relative position of the block.
We need a way to force the encoder to map these two visually distinct images to the same latent representation if they result in the same behavior.
The Solution: CLASS
CLASS tackles this by decoupling representation learning from policy learning. It pre-trains the visual encoder using Supervised Contrastive Learning, but with a twist: the supervision comes from the similarity of action sequences.
The intuition is straightforward: If two robot states lead to the same sequence of actions (e.g., “move forward 10cm, close gripper, lift”), those states are semantically equivalent, even if one is viewed from the top and the other from the side.
The CLASS framework consists of two main stages:
- Action Sequence Similarity: Calculating how similar two trajectories are using Dynamic Time Warping (DTW).
- Soft Contrastive Learning: Training the encoder to pull visually different but behaviorally similar states together.
Let’s break these down.
1. Measuring Behavior with Dynamic Time Warping (DTW)
We cannot simply compare actions step-by-step using Euclidean distance (L2 error). Human demonstrations are noisy. One demonstration might be slightly faster than another, or have a slight pause. A frame-by-frame comparison would penalize these temporal misalignments heavily, even if the overall motion is identical.
To solve this, the authors utilize Dynamic Time Warping (DTW). DTW is an algorithm that finds the optimal alignment between two time-series sequences. It allows for “warping” the time axis—stretching or compressing sections of the sequence—to find the best match.
Given two action sequences \(\mathbf{A}^m\) and \(\mathbf{A}^n\) from two different demonstrations, DTW computes the minimum cumulative distance between them:

Here, \(\Gamma\) represents the alignment path. By using DTW, CLASS can identify that a “slow pick-up” and a “fast pick-up” are fundamentally the same behavior.
2. Soft Contrastive Learning
Once we have the DTW distances, we know which images represent similar behaviors. The next step is to train the neural network (the encoder) to recognize this.
Classic contrastive learning (like SimCLR) is binary: two augmented views of the same image are “positive” pairs (pull them together), and any other image is a “negative” pair (push them apart).
CLASS extends this to a Soft InfoNCE objective. It doesn’t just treat pairs as binary positives or negatives. Instead, it assigns a continuous weight \(w_{ij}\) to every pair of images based on how similar their future actions are.
First, the method defines a weight \(w_{ij}\) derived from the DTW distance. If the DTW distance is low (actions are very similar), the weight is high. If the distance is high, the weight is 0.

Here, CDF is the cumulative distribution function. Effectively, this equation says: “The closer your action sequences are, the stronger the bond between your visual representations should be.”
This weight is then injected into the contrastive loss function:

Let’s decode this loss function:
- The Goal: We want to maximize the probability \(p_{ij}\).
- \(S_{ij}\): This is the cosine similarity between the learned visual features of image \(i\) and image \(j\).
- \(w_{ij}\): This is our behavior-based weight.
- The Mechanism: The loss encourages the encoder to produce high similarity scores (\(S_{ij}\)) for pairs where the action similarity weight (\(w_{ij}\)) is high.
Crucially, this allows the model to learn a gradient of similarity. It creates a structured latent space where states are clustered not by how they look, but by what the robot is about to do.
The Full Architecture
Visualizing the training pipeline helps clarify how these pieces fit together.

- Inner Block (Representation Learning): This is the unique CLASS contribution. We take a batch of data (Anchors). We find other samples in the dataset that have similar action sequences (Positives) and those that don’t (Negatives). We calculate the DTW scores and apply the Soft InfoNCE loss to train the ResNet-18 Encoder.
- Outer Block (Policy Learning): Once the encoder is trained, its weights can be frozen (or fine-tuned with a low learning rate). We then attach a Policy Head (like a Diffusion Policy or MLP) that takes the robust features from the encoder and predicts actions using standard Behavior Cloning.
Experimental Setup: The Challenge of Heterogeneity
To prove that CLASS solves the visual generalization problem, the researchers set up experiments specifically designed to break standard Behavior Cloning. They introduced Heterogeneous Data Collection setups.

- Dynamic Camera (Dyn-Cam): The camera moves randomly during the episode.
- Random Static Camera (Rand-Cam): The camera is placed in a random location for each episode but stays fixed.
- Random Object Color (Rand-Color): The objects change color every episode.
These variations mean that the pixel inputs for the “same” state look wildly different across demonstrations.
Key Results
The experiments spanned 5 simulation benchmarks (including Robomimic and LIBERO) and 3 real-world tasks. The results were compared against standard Behavior Cloning (MLP and Diffusion Policy) and other representation learning methods like R3M, MVP, and TCN.
1. Simulation Performance
The summary table below highlights the dramatic difference in performance, particularly in the “Dyn-Cam” (Dynamic Camera) columns.

Take a look at the Square task under Dyn-Cam:
- Standard Diffusion Policy (DP) achieved a 6% success rate. It completely failed to generalize to the moving camera.
- CLASS-DP (Diffusion Policy using CLASS representations) achieved a 62% success rate (parametric) and 67% (non-parametric).
This is an order-of-magnitude improvement. It confirms that while standard BC overfits to the camera angle, CLASS successfully extracts the invariant state information.
2. Visualizing the Latent Space
Why does it work so well? We can look “inside” the neural network using t-SNE, a technique for visualizing high-dimensional data.

On the left (Standard BC), the trajectories are scattered. The model is confused by the changing camera angles, mapping similar physical states to totally different areas of the embedding space. On the right (CLASS), we see clean, tight clusters. All the “approaching block” states are grouped together, and all the “lifting block” states are grouped together, regardless of the camera angle.
We can further verify this by looking at Nearest Neighbors. If we query the dataset using an image from a specific angle, what does the model think is “similar”?

For Standard BC (above), the nearest neighbors are visually identical. It retrieves images with the same camera angle. This confirms it is overfitting to the viewpoint.

For CLASS (above), the nearest neighbors show the same physical state (the robot arm is in the same position relative to the block) but from different camera angles. This is exactly what we want: viewpoint invariance.
3. Real-World Robustness
Simulations are useful, but the real test is physical hardware. The researchers tested CLASS on three tasks: Stacking, Mug Hanging, and Toaster Loading. They moved the camera on a tripod to random positions between demonstrations.

The results mirrored the simulation. In the Mug-Hang task, standard ImageNet-pretrained Diffusion Policy achieved a 0% success rate in hanging the mug (it could grasp it, but failed the precision placement). CLASS-DP achieved a 55-65% success rate.
4. Why Design Choices Matter
The researchers performed ablation studies to justify their architectural decisions.

- Hard vs. Soft (Chart a): Using “Soft” weights (the gradient of similarity) is crucial. Reverting to binary “Hard” contrastive learning drops performance significantly.
- Sequence Length (Chart b): You need to look at a sequence of actions. If the window is too short, you lack context.
- Similarity Metric (Chart c): DTW outperforms simple L2 distance because it handles temporal misalignment in human demonstrations.
5. Efficiency and Scaling
A common concern with pre-training is the computational cost. Interestingly, CLASS speeds up the overall process.

Because the representation is learned first and is highly structured, the downstream policy learning (fine-tuning) converges much faster than training BC from scratch.
Furthermore, CLASS scales well. As you add more demonstrations, the quality of the positive pairs improves, leading to better representations.

Parametric vs. Non-Parametric Inference
One fascinating aspect of the CLASS paper is that it evaluates the representation in two ways:
- Parametric: Training a neural network (Policy Head) to predict actions.
- Non-Parametric (Retrieval): Not training a policy head at all. Instead, at test time, the robot encodes the current image, finds the nearest neighbors in the training set (using cosine similarity), and averages their action sequences.
The formula for this retrieval-based rollout is:

Ideally, if the representation is perfect, you don’t even need a policy network—you can just copy the actions of the most similar training examples. The experiments showed that Rep-Only (Non-parametric) performance was surprisingly competitive, often matching or beating standard BC, with significantly faster inference times (since it avoids heavy diffusion steps).

Conclusion
The CLASS framework highlights a critical insight for the future of robotic learning: Actions are the best supervisor for vision.
When we want robots to perform tasks, the specific pixel values of the background or the exact angle of the camera are irrelevant. What matters is the functional relationship between the robot and the object. By forcing the visual encoder to organize the world based on “what happens next” (action sequence similarity) rather than “what it looks like” (pixel similarity), CLASS creates representations that are robust to the chaos of the real world.
For students and researchers entering the field, CLASS serves as a prime example of how Self-Supervised Learning (SSL) and Contrastive Learning can be adapted from computer vision to robotics. It moves us away from brittle, single-view imitation toward robust, generalizable behavior learning.
As we move toward training robots on massive, “in-the-wild” datasets (like YouTube videos or diverse robot fleets), methods like CLASS that can extract shared behavior from heterogeneous data will be essential building blocks.
](https://deep-paper.org/en/paper/2508.01600/images/cover.png)