Breaking the Codebook Collapse: How STAR Teaches Robots Diverse Skills via Geometric Rotation

Imagine trying to teach a robot to cook a meal. You don’t tell the robot every single millisecond of muscle movement required to crack an egg. Instead, you think in terms of “skills”: grasp the egg, hit the edge of the pan, pull the shells apart.

This hierarchical approach—breaking complex long-horizon tasks into discrete, reusable skills—is the holy grail of robotic manipulation. However, translating continuous robot actions into these discrete “words” or “tokens” is notoriously difficult. Current methods often suffer from codebook collapse, where the robot ignores most of the skills it could learn, relying on just a tiny subset of repetitive actions. Furthermore, even if the robot learns the skills, stringing them together smoothly (composition) is a separate headache.

In this post, we are diving deep into STAR (Skill Training with Augmented Rotation), a new framework presented at ICML 2025. STAR introduces a clever geometric trick to fix the codebook collapse problem and utilizes a causal transformer to chain these skills together for complex tasks like opening drawers or organizing objects.

The Problem: Why Robot Skills “Collapse”

To understand STAR, we first need to understand how modern robots learn “skills.” A popular method is Vector Quantization (VQ).

The Intuition Behind VQ

Think of VQ as a translator. The robot sees a continuous stream of messy action data (joint angles, velocities). VQ attempts to map these continuous movements to the nearest “prototype” movement in a fixed dictionary, known as a codebook.

If the codebook has 100 entries, ideally, the robot should learn 100 distinct skills (e.g., push left, lift up, twist knob).

The Reality: Codebook Collapse

In practice, neural networks are lazy. When training a VQ-VAE (Vector Quantized Variational Autoencoder), the model often finds that using just 3 or 4 of those 100 codes is “good enough” to minimize error early in training. It stops exploring the rest of the dictionary.

This is called codebook collapse. The result? A robot that lacks diversity. It might know how to “push forward” generally, but it lacks the nuance to “push gently” versus “shove hard” because all those variations collapsed into a single, crude skill code.

The culprit is often the way gradients are calculated. Because the “snapping” operation (rounding to the nearest code) is not differentiable, researchers use a hack called the Straight-Through Estimator (STE). STE essentially pretends the gradient is passed through unchanged. But this ignores the geometry of the embedding space, leading to poor updates and the eventual death of codebook diversity.

Enter STAR: A Two-Stage Solution

The STAR framework addresses these issues in two distinct stages:

RaRSQ (Rotation-augmented Residual Skill Quantization): A better way to learn the dictionary of skills without collapsing.
CST (Causal Skill Transformer): A better way to string those skills together to perform tasks.

Let’s look at the high-level architecture:

Overview of the STAR framework. Top: The RaRSQ module encodes continuous action sequences into hierarchical discrete skills. Bottom: The CST module processes multimodal inputs to generate actions.

As shown in Figure 2, the top half focuses on learning the skills (RaRSQ), while the bottom half focuses on using them (CST).

Part 1: RaRSQ (Learning Diverse Skills)

The core innovation of STAR is how it handles the quantization process. The authors propose Rotation-augmented Residual Skill Quantization (RaRSQ).

The “Residual” Aspect

Instead of mapping an action to just one code, RaRSQ uses a hierarchy. It’s like describing a location:

Level 1 (Coarse): “New York City”
Level 2 (Fine): “Times Square”

Mathematically, the system calculates a residual. It finds the closest code for the action, subtracts that code from the action, and then tries to quantize the remainder (the residual) using a second codebook.

The “Rotation” Trick

This is the mathematical heart of the paper. Standard VQ methods simply copy the gradient from the decoder to the encoder (STE). This ignores the angular relationship between the encoder’s output and the codebook vector.

STAR replaces this with a rotation-based gradient flow. Instead of just snapping the vector to the code, it calculates a rotation matrix \(\mathbf{R}\) that aligns the input residual \(\mathbf{r}\) with the codebook vector \(\mathbf{e}\).

The update rule looks like this:

Equation showing the rotation-augmented update rule.

Here, sg stands for stop-gradient. The system rotates the residual to match the codebook vector’s direction.

Why does this matter?

Look at Figure 1 below. It compares Naive VQ (using STE) with RaRSQ.

Comparison between naive VQ and RaRSQ. The bottom row visualizes gradient flow: RaRSQ maintains geometric relationships, preventing points from collapsing into a single cluster.

In the Naive VQ (middle row), the gradients (arrows) are identical for all points assigned to a specific code. They all get pushed in the exact same direction. This causes the embeddings to clump together aggressively, leaving other codes unused.

In RaRSQ (bottom row), the gradients are derived from rotation. This preserves the relative angles. Points are pushed or pulled based on their geometric relationship to the code. This “fanning out” of gradients prevents the embeddings from collapsing into a single point, maintaining a rich, diverse space where many skills can coexist.

The Rotation Matrix

For those interested in the math, the rotation matrix \(\mathbf{R}_d\) is constructed to align the residual \(\mathbf{r}_{d-1}\) with the chosen code vector. The formula ensures that the geometric structure is encoded into the gradient flow:

Rotation Matrix Formula

This mechanism ensures that when the network updates its weights during backpropagation, it respects the geometry of the latent space, forcing the model to distinguish between slightly different actions rather than lumping them together.

Part 2: CST (Composing Skills)

Once RaRSQ has learned a diverse library of skills (represented as discrete codes), we need a brain to select them. This is the Causal Skill Transformer (CST).

Autoregressive Prediction

Robotic tasks are sequential. You must grasp before you lift. CST models this dependency explicitly. It uses a transformer to predict the next skill code based on the history of observations (images, robot state) and previous skills.

The probability of a sequence of skills is modeled as:

Probability equation for hierarchical skill prediction

Because RaRSQ uses a residual hierarchy (Coarse \(\to\) Fine), CST predicts the Level 1 code first, and then conditions on that to predict the Level 2 code. This mirrors how humans plan: decide on the general motion first, then refine the details.

Discrete codes are great for reasoning, but the real world is continuous. A discrete code might say “move hand to coordinate (10, 10),” but the object is actually at (10.1, 9.9). If the robot relies solely on the discrete code, it will be clumsy.

To solve this, CST includes a continuous offset head. It predicts a small, continuous adjustment \(\zeta_{\text{ref}}\) to add to the decoded action.

Action refinement equation

This hybrid approach gives the robot the best of both worlds: the structured reasoning of discrete skills and the high-precision control of continuous regression.

Training Objective

The CST is trained with a dual objective: it must accurately classify the correct skill code (using Cross-Entropy loss) and accurately predict the continuous action trajectory (using Mean Squared Error).

Total Loss Function

Experiments and Results

The researchers tested STAR on two major benchmarks: LIBERO (a suite of 130 language-conditioned tasks) and MetaWorld MT50.

Does it beat the state-of-the-art?

Yes, and by a significant margin.

In the MetaWorld MT50 benchmark, which consists of 50 distinct manipulation tasks, STAR achieved a 92.7% success rate, consistently outperforming strong baselines like Diffusion Policy and VQ-BeT.

Performance comparison on MetaWorld MT50. STAR achieves 92.7% success rate.

The results on LIBERO were even more telling. LIBERO includes “Long-Horizon” tasks, which are notoriously difficult because errors accumulate over time.

Performance Table on LIBERO benchmarks. STAR achieves 93.6% overall, dominating on long-horizon tasks.

Looking at Table 1 (above), notice the LIBERO-Long column.

ACT: 44.0%
VQ-BeT: 59.7%
QueST: 69.1%
STAR (Ours): 88.5%

That is a massive ~19% jump in performance over the previous state-of-the-art (QueST). This suggests that STAR’s ability to maintain diverse skills and refine them allows it to handle long sequences without losing track or precision.

Did it fix Codebook Collapse?

The authors claim RaRSQ prevents collapse. They proved this by analyzing how often each code in the codebook was used during training.

Codebook utilization analysis. RaRSQ (light blue) uses all 16 codes, while Naive VQ (dark blue) uses only 7.

Figure 4 is the “smoking gun.”

Naive VQ-VAE (Dark Blue): It heavily overuses Code 0 and Code 4, while ignoring almost half the dictionary (codes 8-13 are barely touched). This is classic collapse.
RaRSQ (Light Blue): The usage is distributed across all 16 codes. The robot effectively learned 16 distinct skill primitives instead of just 7, giving it a much richer vocabulary of movements to draw from.

Real-World Performance

Simulation is fine, but can it work on a real robot? The authors tested STAR on an ALOHA robot arm performing sequential tasks, such as “Pick the cube into the plate and pick the toy into the box.”

Visualization of real-world manipulation tasks: Sequential pick-and-place and drawer manipulation.

In these tests, STAR showed superior stability. For example, in a drawer opening/closing task, baseline methods often failed after the first step (opening), unable to transition to placing the object. STAR maintained coherence throughout the sequence.

Real world sequential task results

As shown in Table 3, while other methods (VQ-BeT, QueST) struggled to complete the full sequence (0/10 and 1/10 success), STAR successfully completed the full “Open \(\to\) Place \(\to\) Close” chain 30% of the time, with high success rates on the individual stages.

Why This Matters

The STAR framework provides a significant step forward in Hierarchical Imitation Learning.

Geometry Matters: It highlights that we cannot simply treat vector quantization as a “black box.” By respecting the geometry of the latent space (via rotation), we get better gradients and richer representations.
Diversity is Key: A robot with a small vocabulary of skills is a clumsy robot. Preventing codebook collapse is essential for handling the variability of the real world.
Composition + Refinement: Merely picking a skill isn’t enough; you need to refine it. The combination of the Causal Transformer with the continuous offset head ensures that high-level planning meets low-level motor control.

By solving the “collapse” of discrete representations, STAR allows robots to learn a true library of diverse skills, bringing us closer to general-purpose robotic helpers that can cook, clean, and organize without needing a hard-coded script for every movement.

Breaking the Codebook Collapse: How STAR Teaches Robots Diverse Skills via Geometric Rotation#

The Problem: Why Robot Skills “Collapse”#

The Intuition Behind VQ#

The Reality: Codebook Collapse#

Enter STAR: A Two-Stage Solution#

Part 1: RaRSQ (Learning Diverse Skills)#

The “Residual” Aspect#

The “Rotation” Trick#

The Rotation Matrix#

Part 2: CST (Composing Skills)#

Autoregressive Prediction#

Action Refinement (The Offset)#

Training Objective#

Experiments and Results#

Does it beat the state-of-the-art?#

Did it fix Codebook Collapse?#

Real-World Performance#

Why This Matters#