How Robots Learn to 'Click' with the World: A Deep Dive into GLOVER++ and HOVA-500K

Introduction

When you look at a coffee mug, you don’t just see a cylindrical object with a curved protrusion; you intuitively see a handle that can be grasped. When you look at a drawer, you see a knob meant to be pulled. In psychology and robotics, this concept is known as affordance—the actionable properties of an object that define how an agent can interact with it.

For humans, recognizing affordances is effortless. For robots, it is an immense challenge. While recent advancements in Vision-Language Models (VLMs) have given machines the ability to describe scenes and answer questions, bridging the gap between high-level semantic understanding (“That is a mug”) and low-level robotic control (“Grasp the handle at these coordinates”) remains a bottleneck.

Robots need more than just bounding boxes; they need specific, actionable points in 3D space. They need to know exactly where to touch and how to interact, regardless of whether the object is familiar or brand new.

This brings us to GLOVER++, a new research framework designed to bridge the gap between human demonstrations and robotic manipulation. In this post, we will dissect the GLOVER++ paper, exploring how it leverages a massive new dataset (HOVA-500K) and a novel global-to-local decoding architecture to achieve state-of-the-art results in affordance reasoning.

Overview of GLOVER++ capabilities, training pipeline, and transfer learning potential.

As illustrated in Figure 1 above, GLOVER++ is designed to observe human behaviors (like opening a drawer) and transfer that “actionable knowledge” to robots, allowing them to manipulate objects in diverse environments, from simulations to the real world.

The Data Problem: Introducing HOVA-500K

Before we dive into the model architecture, we must address the fuel that powers it: data.

The researchers identified a critical flaw in existing robotic datasets. Previous datasets were either too small, lacked diversity, or provided annotations that were too vague (like large segmentation masks) rather than precise interaction points. To solve this, they introduced HOVA-500K, a large-scale affordance-annotated dataset.

Why HOVA-500K Matters

HOVA-500K stands for Human-Object Visual Affordance. Its scale is impressive compared to predecessors:

500,000 images
1,726 object categories
675 action categories

But scale isn’t the only contribution. The dataset marks a shift from identifying “regions” to identifying points. In robotic manipulation, knowing the general region of a handle isn’t enough; the end-effector needs a precise coordinate to execute a grasp.

Examples of HOVA-500K showing actions, objects, and Gaussian-distributed affordance heatmaps.

As shown in Figure 3, the dataset provides heatmaps (Gaussian distributions) centered on the precise interaction point. Whether it’s turning a faucet, pulling a door, or holding a bowl, the annotation highlights the exact spot where the interaction occurs.

How Was It Collected?

Manually annotating half a million interaction points would be prohibitively expensive and slow. The researchers developed a clever semi-automatic pipeline to mine this data from human demonstration videos (like Ego4D and EPIC-KITCHEN).

The challenge with human videos is occlusion: when a human grabs a mug, their hand covers the very affordance point we want the robot to see. To solve this, the team used a homography-based approach.

The pipeline for locating affordable points using homography to see ‘behind’ the hand.

Here is the process illustrated in Figure 8:

Contact Frame Detection: Identify the moment the human touches the object.
Skin Segmentation: Find the hand and the object.
Back-Projection: The system looks at previous frames (observation frames) where the hand hadn’t yet covered the object. By calculating the homography (the geometric transformation between frames), it projects the contact point from the interaction frame back onto the unobstructed view of the object.

This allows the dataset to contain clean images of objects with precise annotations of where a human will touch them.

The Method: Inside GLOVER++

With a massive dataset in hand, the researchers needed an architecture capable of utilizing it effectively. The core innovation of GLOVER++ is its Global-to-Local affordance tuning policy.

The Challenge of Scale vs. Precision

Standard VLMs are great at global semantics—understanding the scene as a whole. However, they often struggle with fine-grained localization. If you ask a standard model to “find the handle,” it might highlight the whole drawer or get confused by background noise. Conversely, models trained only on local geometry might miss the semantic context of which drawer to open.

GLOVER++ solves this by processing visual information in two stages.

The complete GLOVER++ model pipeline.

The system takes an RGB image and a language instruction (e.g., “Open the top drawer”). These are fed into a Visual Language Model (VLM)—specifically LLaVA-1.5. The model introduces a special token, <AFF> (Affordance Token), which aggregates the visual and linguistic features into a hidden latent representation.

Step 2: Global Decoding

The <AFF> token is passed to a Global Decoder. This decoder acts as a high-level semantic filter. It looks at the entire image and identifies the general areas relevant to the instruction.

\[ \mathcal { M } _ { s e m } ^ { 2 D } = { \mathbf { F } } _ { d e c } ^ { g l o } ( { \mathbf { < A F F > } } , v ) \]

The output of this stage is a semantic logits map (\(\mathcal{M}_{sem}^{2D}\)). While this map captures the correct object, it often contains “background noise”—irrelevant regions that are semantically similar but not actionable.

Step 3: Local Decoding

This is where GLOVER++ shines. The semantic map from the global decoder is used as a mask prompt for a second, Local Decoder. This decoder refines the prediction, focusing on local geometric details to pinpoint the exact affordance region.

\[ \mathcal { M } _ { A } ^ { 2 D } = { \mathbf { F } } _ { d e c } ^ { l o c } ( \mathcal { M } _ { s e m } ^ { 2 D } , v ) \]

By cascading these decoders, the model filters out noise. The global decoder says, “Look at the top drawer,” and the local decoder says, “Specifically, grasp this handle.”

Visualization of global vs. local decoder outputs.

Figure 4 demonstrates this effect clearly. Look at the “Pick up saw” example (left). The global decoder (middle column) highlights the saw but also picks up some background noise. The local decoder (right column) cleans this up, focusing intensely on the handle where the grasp should happen.

Training Objective

To train this system, the researchers used a combination of Focal Loss (to handle class imbalance between the tiny affordance point and the rest of the image) and Kullback-Leibler Divergence (KLD) loss.

\[ \mathcal { L } = \mathcal { L } _ { F L } ( \mathcal { M } _ { A } ^ { 2 D } , \mathcal { M } _ { g t } ^ { 2 D } ) + \mathcal { L } _ { K L } ( \mathcal { M } _ { A } ^ { 2 D } , \mathcal { M } _ { g t } ^ { 2 D } ) \]

The KLD loss is particularly important because it forces the predicted probability distribution to match the Gaussian distribution of the ground truth. This ensures the model predicts a coherent “hotspot” rather than scattered pixels.

Experiments and Results

The researchers evaluated GLOVER++ across several domains, comparing it against both specialist affordance models and generalist VLMs like Qwen-2.5-VL.

Vision-Language Affordance Reasoning

On the HOVA-500K benchmark, GLOVER++ achieved state-of-the-art results. But numbers on a table are one thing; qualitative results tell a better story.

When compared to Qwen-2.5-VL (a powerful 7B parameter model), GLOVER++ showed superior physical grounding.

Comparison between GLOVER++ and Qwen-2.5-VL.

In Figure 6, look at the task “Pick up the left mug.” Both models identify the mug. However, GLOVER++ (bottom row) identifies a grasp point on the handle, whereas Qwen-2.5-VL (top row) acts more generally on the object body, which might lead to a failed grasp. The same applies to “Use plug”—GLOVER++ targets the button/interface, while the baseline is less precise.

Zero-Shot Manipulation

One of the most robust tests for a robotic model is Zero-Shot Manipulation—handling objects it has never seen before.

The researchers tested this in both simulation (IsaacGym) and the real world. In real-world tests using a UFactory xArm, GLOVER++ achieved a 73.3% average success rate, significantly outperforming the retrieval-based baseline RAM (46.7%).

This success stems from GLOVER++’s “open-vocabulary” nature. Because it is built on top of a VLM (LLaVA), it understands language concepts (“left,” “top,” “red”) and can apply that understanding to find affordances on novel objects without retraining.

Imitation Learning

Affordance prediction isn’t just about simple grasping; it can also accelerate learning complex skills. The researchers integrated GLOVER++ into an Imitation Learning pipeline (using RLBench).

Using affordance as a prior for imitation learning.

As shown in Figure 14, instead of just feeding raw pixels to the robot’s policy network, they used the affordance maps generated by GLOVER++ as an attention prior. This effectively tells the robot, “Pay attention to this part of the image.”

The results were telling. In tasks requiring high precision—like “insert peg” or “stack cups”—adding the affordance prior significantly boosted success rates compared to standard baselines (RVT).

Extending Capabilities: Long-Horizon and Bimanual Tasks

The paper explores two exciting extensions that showcase the versatility of GLOVER++.

1. Long-Horizon Planning with VLM

GLOVER++ is a perception module; it finds points. It is not a planner. However, it can be paired with a “brain” like Qwen-2.5-VL to perform multi-step tasks.

Left: Long-horizon planning. Right: Bimanual manipulation.

In the left side of Figure 7, the user asks to “Put the jar into the top drawer.”

Planner (VLM): Decomposes the task: “Open top drawer” -> “Pick up jar” -> “Place in drawer.”
Perception (GLOVER++):

Finds the handle of the top drawer.
Finds the grasp point on the jar.
Finds the placement point inside the drawer.

This modular approach allows the robot to execute complex chains of actions by constantly grounding the planner’s instructions into physical coordinates.

2. Bimanual (Two-Handed) Manipulation

Because GLOVER++ understands spatial language, it can reason about left and right simultaneously.

In the right side of Figure 7 (and demonstrated on a Unitree G1 humanoid robot), the model successfully identifies affordances for both hands—holding a cabinet open with the left hand while manipulating an object inside with the right.

Obstacle avoidance IK for bimanual tasks.

Figure 19 illustrates the motion planning used for the humanoid robot. The ability to predict two distinct, spatially aware affordance points from a single image is a significant step toward human-like dexterity.

Limitations and Future Directions

No system is perfect, and the authors are transparent about GLOVER++’s limitations.

1. Static vs. Dynamic: HOVA-500K is built on static images. While it captures the “moment of contact,” it doesn’t capture the trajectory or force dynamics of the interaction.

2. 2D to 3D Ambiguity: The model predicts affordances in 2D images, which are then projected into 3D using depth cameras. This works well usually, but can fail in cluttered scenes or with overlapping objects.

Failure cases in reasoning and execution.

Figure 20 highlights failure cases. In (a), a distant viewpoint makes the object too small for precise affordance prediction. In (b), overlapping objects create “background noise” in the probability map, confusing the system.

3. Execution Failures: Even if the vision is perfect, the robot might fail. Figure 21 (in the paper’s appendix) shows instances where the grasp failed due to self-collision or inaccuracies in the depth sensor (z-axis error), reminding us that perception is only half the battle in robotics.

Conclusion

GLOVER++ represents a significant leap forward in robotic perception. By moving away from generic object detection and towards actionable affordance detection, it gives robots a much more human-like understanding of their environment.

The contributions are twofold:

HOVA-500K: A massive, high-quality dataset that will likely serve as a benchmark for future affordance research.
GLOVER++ Framework: A smart, global-to-local architecture that balances semantic understanding with geometric precision.

As robots move out of factories and into our homes, the ability to understand “open this drawer” or “pick up that mug by the handle” will be non-negotiable. Work like GLOVER++ paves the way for that future.

Introduction#

The Data Problem: Introducing HOVA-500K#

Why HOVA-500K Matters#

How Was It Collected?#

The Method: Inside GLOVER++#

The Challenge of Scale vs. Precision#

Step 1: Multi-Modal Encoding#

Step 2: Global Decoding#

Step 3: Local Decoding#

Training Objective#

Experiments and Results#

Vision-Language Affordance Reasoning#

Zero-Shot Manipulation#

Imitation Learning#

Extending Capabilities: Long-Horizon and Bimanual Tasks#

1. Long-Horizon Planning with VLM#

2. Bimanual (Two-Handed) Manipulation#

Limitations and Future Directions#

Conclusion#