Imagine you are making a cup of tea. When you reach for the kettle, your eyes lock onto the handle. When you pour the water, your focus shifts entirely to the spout and the water level in the cup. You essentially ignore the toaster, the fridge, and the pattern on the tablecloth. Your visual perception is task-aware and evolves as the task progresses.
Now, consider how robots typically “see.” In standard robotic manipulation pipelines, the robot takes an image of the scene and compresses it into a representation (a set of features). Crucially, this process is usually task-agnostic. The robot processes the toaster, the fridge, and the kettle with equal importance, regardless of whether it’s trying to boil water or make toast.
This disconnect creates a massive inefficiency. If a robot treats every pixel as equally important, it struggles to generalize and requires significantly more data to learn simple tasks.
In this post, we dive into HyperTASR, a new framework presented at CoRL 2025 that bridges this gap. By using Hypernetworks, this method allows robots to dynamically rewire their visual processing based on what they are doing and when they are doing it.

The Problem: Static Eyes in a Dynamic World
To understand why HyperTASR is necessary, we first need to look at the standard recipe for “End-to-End” robotic learning.
A typical policy learning pipeline has two main parts:
- Representation Extractor (\(\phi\)): Takes raw sensory data (like an RGB-D image, \(o_t\)) and turns it into a compact vector or feature map (\(z_t\)).
- Policy Network (\(\pi\)): Takes that representation (\(z_t\)) and the task instruction (\(\tau\)) to predict the robot’s next action (\(a_t\)).
The issue is that the Representation Extractor is usually static. It learns a “one-size-fits-all” way to look at the world. Whether the robot is asked to “open the jar” or “pick up the fork,” the visual features extracted are identical until they hit the policy network.
This contradicts cognitive science. Humans employ dynamic perceptual adaptation. We filter out irrelevant noise based on our goals. HyperTASR (Hypernetwork-Driven Task-Aware Scene Representations) aims to bring this biological efficiency to robotics.
The Solution: HyperTASR
The core idea of HyperTASR is simple but powerful: don’t just condition the action on the task; condition the vision on the task.
Instead of a static encoding \(z_t = \phi(o_t)\), we want a dynamic encoding \(z_t = \phi(o_t, \tau)\), where the extraction process itself changes based on the task (\(\tau\)). Furthermore, because tasks change over time (localization \(\rightarrow\) grasping \(\rightarrow\) manipulating), the vision should also depend on the progression state of the task.
The Architecture
How do you make a neural network change its weights on the fly? The authors utilize Hypernetworks.
A Hypernetwork is a neural network that, instead of outputting a classification or an image, outputs the weights (parameters) for another neural network.

As shown in Figure 2 above, the HyperTASR pipeline works as follows:
- Input: The system takes an observation (\(o_t\)).
- Base Representation: A standard encoder extracts a generic representation (\(z_t\)).
- The Hypernetwork (\(\mathcal{H}\)): This is the brain of the operation. It takes the Task Information (\(\tau\)) and the Task Progression State (\(\psi_t\)). It outputs specific parameters (\(\theta\)).
- Transformation: These parameters (\(\theta\)) are loaded into a lightweight “Transformation Network” (an autoencoder). This network transforms the generic representation into a Task-Aware Representation (\(z_t^*\)).
- Policy: The policy network uses this focused, relevant representation to predict the action.
Under the Hood: The Math of Adaptation
Let’s break down the transformation mathematically. The authors introduce a transformation layer that consists of an encoder \(f\) and a decoder \(g\).
The transformed representation \(z_t^*\) is calculated as:

Here, \(f\) encodes the features and \(g\) decodes them back to the original dimension (so it fits into existing policy networks). The innovation lies in \(\theta\) (the weights of \(f\)). In standard networks, \(\theta\) is learned once and stays fixed. In HyperTASR, \(\theta\) is a function of the context:

This implies that the way the robot compresses the scene changes based on the task \(\tau\) and the current progress \(\psi_t\). The parameters \(\theta\) are generated by the Hypernetwork \(\mathcal{H}\):

This separation is crucial. By keeping the main visual backbone static and only dynamically generating the weights for this lightweight transformation layer, the system is computationally efficient and modular. It separates “seeing” (the backbone) from “attending” (the hypernetwork).
Detailed Model Structure
The transformation network isn’t just a simple linear layer; it needs to understand spatial relationships. The authors implement it as a U-Net style autoencoder with skip connections.

The Hypernetwork doesn’t just predict one big blob of weights. As shown below, it uses an optimization-biased approach where it iteratively predicts parameter updates, ensuring stability and easier training.

Experiments: Does it Work?
The researchers integrated HyperTASR into two state-of-the-art baselines:
- GNFactor: A method that builds 3D volumetric representations.
- 3D Diffuser Actor (3D DA): A strong baseline using pre-trained 2D backbones and point clouds.
They tested these on RLBench, a challenging simulation benchmark, specifically in a single-view setting. Single-view is notoriously difficult because the robot has to infer depth and geometry from just one camera angle, making efficient feature extraction critical.
Quantitative Results
The results were statistically significant. By simply adding HyperTASR to these existing methods, performance improved dramatically.

Key takeaways from the data:
- GNFactor Integration: Success rate jumped by nearly 10% absolute (27% relative improvement).
- 3D Diffuser Actor Integration: Pushed the success rate to over 81%, achieving state-of-the-art results for single-view manipulation.
- Efficiency: Despite adding a “network generating a network,” the overall inference time remained low, and training efficiency actually improved because the model converged faster.
Visualizing “Attention”
Numbers are great, but seeing what the robot sees is more telling. The researchers visualized the attention maps of the policies.
In the figure below, look at the “slide block” task (top row).
- Baseline (Standard): The attention (red/yellow areas) is scattered. It looks at the block, the table edges, and empty space.
- HyperTASR (Ours): The attention is laser-focused on the block and the target zone.

This confirms the hypothesis: the hypernetwork successfully modulated the features to suppress background noise and amplify task-relevant signals.
Real-World Validation
Simulations are useful, but the real world is messy. The authors deployed HyperTASR on a physical robot arm (Piper with a parallel gripper) performing tasks like stacking cups and cleaning.
Even with limited demonstration data (only 15 demos per task), HyperTASR outperformed the baseline 3D Diffuser Actor.

In the visualization above (Figure 11), you can see the attention shift.
- Approach: Attention is on the yellow cup (target) and the gripper.
- Lift: Attention tracks the cup as it moves.
- Place: Attention expands to include the grey cup (destination).
This dynamic shift—tracking the “what” and the “where” as the task evolves—is exactly what HyperTASR promised to deliver.
Comparison with GNFactor
The benefits are even more obvious when comparing the action sequences directly against GNFactor in simulation.

In the “slide block” task (Fig 8), GNFactor struggles to maintain a consistent understanding of where the block is relative to the target. HyperTASR maintains a robust lock on the object’s affordances throughout the trajectory.
Why This Matters
HyperTASR represents a shift in how we think about robotic perception. For a long time, the community focused on making “bigger and better” generic backbones (like CLIP or ResNet) that see everything.
This paper argues that context is king. A smaller, adaptable representation that knows what it’s looking for is often more effective than a massive, static representation that sees everything but prioritizes nothing.
By using hypernetworks to separate the “task context” from the “visual processing,” HyperTASR offers a modular way to improve almost any robotic policy. It doesn’t require retraining the massive visual backbones; it just inserts a smart, adaptive lens in front of them.
Conclusion
Robotic manipulation is moving away from static pattern matching toward dynamic, cognitive-like processes. HyperTASR demonstrates that giving robots the ability to adapt their “eyes” to the task at hand leads to more robust, efficient, and successful behavior.
Whether it’s sliding a block in a simulator or stacking cups in the real world, the lesson is clear: for a robot to act intelligently, it must first learn to see selectively.
Key Takeaways:
- Problem: Standard robots use static scene representations that don’t adapt to tasks.
- Method: HyperTASR uses Hypernetworks to dynamically generate weights for a representation transformation layer.
- Input: The system conditions on both the Task Objective and the Task Progression State.
- Result: State-of-the-art performance in single-view manipulation and highly focused, human-like attention patterns.
](https://deep-paper.org/en/paper/2508.18802/images/cover.png)