Imagine you are making a cup of tea. When you reach for the kettle, your eyes lock onto the handle. When you pour the water, your focus shifts entirely to the spout and the water level in the cup. You essentially ignore the toaster, the fridge, and the pattern on the tablecloth. Your visual perception is task-aware and evolves as the task progresses.

Now, consider how robots typically “see.” In standard robotic manipulation pipelines, the robot takes an image of the scene and compresses it into a representation (a set of features). Crucially, this process is usually task-agnostic. The robot processes the toaster, the fridge, and the kettle with equal importance, regardless of whether it’s trying to boil water or make toast.

This disconnect creates a massive inefficiency. If a robot treats every pixel as equally important, it struggles to generalize and requires significantly more data to learn simple tasks.

In this post, we dive into HyperTASR, a new framework presented at CoRL 2025 that bridges this gap. By using Hypernetworks, this method allows robots to dynamically rewire their visual processing based on what they are doing and when they are doing it.

Concept comparison: Task-aware vs. Task-agnostic pipelines. Top-left shows HyperTASR modulating features based on the task. Bottom-left shows standard static processing.

The Problem: Static Eyes in a Dynamic World

To understand why HyperTASR is necessary, we first need to look at the standard recipe for “End-to-End” robotic learning.

A typical policy learning pipeline has two main parts:

  1. Representation Extractor (\(\phi\)): Takes raw sensory data (like an RGB-D image, \(o_t\)) and turns it into a compact vector or feature map (\(z_t\)).
  2. Policy Network (\(\pi\)): Takes that representation (\(z_t\)) and the task instruction (\(\tau\)) to predict the robot’s next action (\(a_t\)).

The issue is that the Representation Extractor is usually static. It learns a “one-size-fits-all” way to look at the world. Whether the robot is asked to “open the jar” or “pick up the fork,” the visual features extracted are identical until they hit the policy network.

This contradicts cognitive science. Humans employ dynamic perceptual adaptation. We filter out irrelevant noise based on our goals. HyperTASR (Hypernetwork-Driven Task-Aware Scene Representations) aims to bring this biological efficiency to robotics.

The Solution: HyperTASR

The core idea of HyperTASR is simple but powerful: don’t just condition the action on the task; condition the vision on the task.

Instead of a static encoding \(z_t = \phi(o_t)\), we want a dynamic encoding \(z_t = \phi(o_t, \tau)\), where the extraction process itself changes based on the task (\(\tau\)). Furthermore, because tasks change over time (localization \(\rightarrow\) grasping \(\rightarrow\) manipulating), the vision should also depend on the progression state of the task.

The Architecture

How do you make a neural network change its weights on the fly? The authors utilize Hypernetworks.

A Hypernetwork is a neural network that, instead of outputting a classification or an image, outputs the weights (parameters) for another neural network.

HyperTASR framework overview. The Hypernetwork (green) takes task info and generates parameters for the Representation Transformation (blue dashed box).

As shown in Figure 2 above, the HyperTASR pipeline works as follows:

  1. Input: The system takes an observation (\(o_t\)).
  2. Base Representation: A standard encoder extracts a generic representation (\(z_t\)).
  3. The Hypernetwork (\(\mathcal{H}\)): This is the brain of the operation. It takes the Task Information (\(\tau\)) and the Task Progression State (\(\psi_t\)). It outputs specific parameters (\(\theta\)).
  4. Transformation: These parameters (\(\theta\)) are loaded into a lightweight “Transformation Network” (an autoencoder). This network transforms the generic representation into a Task-Aware Representation (\(z_t^*\)).
  5. Policy: The policy network uses this focused, relevant representation to predict the action.

Under the Hood: The Math of Adaptation

Let’s break down the transformation mathematically. The authors introduce a transformation layer that consists of an encoder \(f\) and a decoder \(g\).

The transformed representation \(z_t^*\) is calculated as:

Equation 1: Basic transformation function.

Here, \(f\) encodes the features and \(g\) decodes them back to the original dimension (so it fits into existing policy networks). The innovation lies in \(\theta\) (the weights of \(f\)). In standard networks, \(\theta\) is learned once and stays fixed. In HyperTASR, \(\theta\) is a function of the context:

Equation 2: Context-conditioned transformation.

This implies that the way the robot compresses the scene changes based on the task \(\tau\) and the current progress \(\psi_t\). The parameters \(\theta\) are generated by the Hypernetwork \(\mathcal{H}\):

Equation 3: The Hypernetwork function.

This separation is crucial. By keeping the main visual backbone static and only dynamically generating the weights for this lightweight transformation layer, the system is computationally efficient and modular. It separates “seeing” (the backbone) from “attending” (the hypernetwork).

Detailed Model Structure

The transformation network isn’t just a simple linear layer; it needs to understand spatial relationships. The authors implement it as a U-Net style autoencoder with skip connections.

Detailed model structure showing the Convolutional Blocks and Transposed Convolutional Blocks.

The Hypernetwork doesn’t just predict one big blob of weights. As shown below, it uses an optimization-biased approach where it iteratively predicts parameter updates, ensuring stability and easier training.

Detailed structure of the Hypernetwork predicting parameter updates iteratively.

Experiments: Does it Work?

The researchers integrated HyperTASR into two state-of-the-art baselines:

  1. GNFactor: A method that builds 3D volumetric representations.
  2. 3D Diffuser Actor (3D DA): A strong baseline using pre-trained 2D backbones and point clouds.

They tested these on RLBench, a challenging simulation benchmark, specifically in a single-view setting. Single-view is notoriously difficult because the robot has to infer depth and geometry from just one camera angle, making efficient feature extraction critical.

Quantitative Results

The results were statistically significant. By simply adding HyperTASR to these existing methods, performance improved dramatically.

Table 1: Evaluation results on RLBench. HyperTASR improves GNFactor success from 33.3% to 42.6% and 3D Diffuser Actor from 79.0% to 81.3%.

Key takeaways from the data:

  • GNFactor Integration: Success rate jumped by nearly 10% absolute (27% relative improvement).
  • 3D Diffuser Actor Integration: Pushed the success rate to over 81%, achieving state-of-the-art results for single-view manipulation.
  • Efficiency: Despite adding a “network generating a network,” the overall inference time remained low, and training efficiency actually improved because the model converged faster.

Visualizing “Attention”

Numbers are great, but seeing what the robot sees is more telling. The researchers visualized the attention maps of the policies.

In the figure below, look at the “slide block” task (top row).

  • Baseline (Standard): The attention (red/yellow areas) is scattered. It looks at the block, the table edges, and empty space.
  • HyperTASR (Ours): The attention is laser-focused on the block and the target zone.

Attention visualization comparison. HyperTASR focuses tightly on relevant objects, whereas baselines show dispersed attention.

This confirms the hypothesis: the hypernetwork successfully modulated the features to suppress background noise and amplify task-relevant signals.

Real-World Validation

Simulations are useful, but the real world is messy. The authors deployed HyperTASR on a physical robot arm (Piper with a parallel gripper) performing tasks like stacking cups and cleaning.

Even with limited demonstration data (only 15 demos per task), HyperTASR outperformed the baseline 3D Diffuser Actor.

Real robot visualization. Top row: 3D Diffuser Actor fails to stack properly. Bottom row: HyperTASR successfully identifies the target and stacks the yellow cup.

In the visualization above (Figure 11), you can see the attention shift.

  1. Approach: Attention is on the yellow cup (target) and the gripper.
  2. Lift: Attention tracks the cup as it moves.
  3. Place: Attention expands to include the grey cup (destination).

This dynamic shift—tracking the “what” and the “where” as the task evolves—is exactly what HyperTASR promised to deliver.

Comparison with GNFactor

The benefits are even more obvious when comparing the action sequences directly against GNFactor in simulation.

Comparison of task execution. GNFactor (top) fails to grasp or align properly. HyperTASR (bottom) executes smooth, successful manipulations.

In the “slide block” task (Fig 8), GNFactor struggles to maintain a consistent understanding of where the block is relative to the target. HyperTASR maintains a robust lock on the object’s affordances throughout the trajectory.

Why This Matters

HyperTASR represents a shift in how we think about robotic perception. For a long time, the community focused on making “bigger and better” generic backbones (like CLIP or ResNet) that see everything.

This paper argues that context is king. A smaller, adaptable representation that knows what it’s looking for is often more effective than a massive, static representation that sees everything but prioritizes nothing.

By using hypernetworks to separate the “task context” from the “visual processing,” HyperTASR offers a modular way to improve almost any robotic policy. It doesn’t require retraining the massive visual backbones; it just inserts a smart, adaptive lens in front of them.

Conclusion

Robotic manipulation is moving away from static pattern matching toward dynamic, cognitive-like processes. HyperTASR demonstrates that giving robots the ability to adapt their “eyes” to the task at hand leads to more robust, efficient, and successful behavior.

Whether it’s sliding a block in a simulator or stacking cups in the real world, the lesson is clear: for a robot to act intelligently, it must first learn to see selectively.

Key Takeaways:

  • Problem: Standard robots use static scene representations that don’t adapt to tasks.
  • Method: HyperTASR uses Hypernetworks to dynamically generate weights for a representation transformation layer.
  • Input: The system conditions on both the Task Objective and the Task Progression State.
  • Result: State-of-the-art performance in single-view manipulation and highly focused, human-like attention patterns.