Introduction: The Problem of the “Norman Door”

We have all been there. You walk up to a door, grab the handle, and pull. It doesn’t budge. You realize, slightly embarrassed, that you were supposed to push.

This scenario, often called the “Norman Door” problem, highlights a fundamental challenge in robotics: visual ambiguity. To a robot’s camera, a push-door and a pull-door often look identical. Two boxes might look the same, but one is empty while the other is filled with lead bricks on the left side.

In the robotics world, a system relying solely on a single snapshot of the world is doomed to fail in these scenarios. To succeed, a robot must act like a human: try an action, observe the outcome (did the door move?), update its internal model, and try again. This is interactive perception.

However, teaching robots to learn from history effectively is notoriously difficult. Today, we are diving deep into a new paper, “Learn from What We HAVE,” which proposes a novel architecture called the History-Aware VErifier (HAVE). Instead of trying to train a single massive “brain” to predict the perfect action from history, the researchers propose a smarter split: use a creative generator to brainstorm options, and a strict, history-aware judge to verify them.

The Core Challenge: Ambiguity and Multimodality

In a Partially Observable Markov Decision Process (POMDP), the robot doesn’t know the true state of the world (e.g., the friction of a hinge or the center of mass). It only sees observations (point clouds or images).

When an environment is ambiguous, there are often multiple valid ways to interact with it, or “modes.” For a closed door, valid initial actions might include “push left,” “push right,” “pull left,” or “pull right.”

The standard Deep Learning approach is Conditional Generation: training a diffusion model or a policy that takes the history of interactions as input and outputs the next action. While this sounds logical, it often fails in practice. Why?

Mode Collapse: Conditional models often struggle to capture all distinct possibilities (modes), reverting to an “average” behavior that fails at everything.
Data Inefficiency: Training a model to implicitly understand physics from video history requires massive amounts of data.

The authors of HAVE take a different route inspired by recent successes in Large Language Models (LLMs): The Generation-Verification Paradigm.

The Method: Decoupling Creativity from Logic

The core insight of HAVE is that generating potential actions and selecting the best action are two different problems that should be solved separately.

The system consists of two distinct modules:

The Generator: An unconditional diffusion model that proposes multiple candidate actions based only on the current observation. It doesn’t care about history; it just suggests actions that are geometrically plausible (e.g., “here is a handle, maybe grab it”).
The Verifier (HAVE): A history-aware model that scores those proposals. It looks at the past (e.g., “we just tried grabbing that handle and pulling, and it failed”) and assigns high scores to actions likely to succeed and low scores to actions likely to repeat past failures.

1. The Generator

The generator uses a Diffusion Transformer (DiT). It takes a noisy 3D Articulation Flow and the current point cloud observation, iteratively denoising it to produce a clean action proposal.

Figure 12: Action Generator Architecture

As shown in Figure 12, the generator uses PointNet++ to encode the geometry. Importantly, this generator is unconditional regarding history. It simply asks: “What are valid ways to grasp this object?” This ensures high diversity in the proposals, preventing the robot from getting stuck in a loop of trying the same failed action.

2. The History-Aware Verifier (HAVE)

This is the heart of the paper. The verifier’s job is to look at a batch of actions proposed by the generator and pick the winner.

The architecture (Figure 1) is designed specifically for reasoning about cause and effect.

Figure 1: HAVE Architecture: the proposed and history actions are encoded through a PointNet++ for 3D geometric understanding,and self-atention layers for sequential reasoning. History results are encoded similarly. Together they pass through an explicit attention layer to obtain the final score.

How it works:

Encoders: The system encodes the Action Proposal (the candidate being judged), the History of Actions (what the robot did before), and the History of Results (how the object moved—or didn’t move—after previous actions).
Observation Flow: The “Results” aren’t just static images. The model calculates the 3D flow between timesteps to explicitly represent movement.
Attention Mechanism: This is the critical step. The model uses an attention layer where:

Query (Q): The current Action Proposal.
Key (K): Past Actions.
Value (V): Past Results.

This structure forces the network to ask: “How similar is this proposed action (Q) to what I did before (K), and what was the result (V)?” If the proposal is similar to a past action that resulted in zero movement (failure), the network learns to assign it a low score.

Theoretical Motivation: Why Verification Wins

You might ask: “Why not just train a better generator?”

The authors provide a compelling theoretical justification. They prove that even if your generator is mediocre, a verifier that is only slightly better than random guessing can significantly improve the expected reward.

Figure 7: Expected and Simulated Reward w.r.t. Sample Number: We can see the expected and simulated improvements the verifier introduces with different generator and verifier accuracy.

Figure 7 illustrates this concept. Even if the generator (Success Probability $p_G$) is low (e.g., 0.2), using a verifier with reasonable accuracy ($p_V$) allows the system’s performance to skyrocket as you sample more candidates ($N$).

By sampling $N$ actions, you increase the probability that at least one “good” action exists in the batch. The verifier simply needs to identify it. This is far easier than forcing a generator to output the single best action on the first try.

Experiments and Results

The team tested HAVE in simulation and the real world across three distinct tasks that involve fundamental ambiguity.

1. Multimodal Doors (The Push/Pull Problem)

They created a dataset of doors that look identical but open in different directions (push/pull, left/right).

Figure 2: Multi-modal Door Dataset Performance: The left plot shows an example of a constructed multi-modal door in simulation with the same geometry but different opening directions. The right bar plot demonstrates the efficacy metric Failure Rate and the efciency metric Steps to Open from which we can see our method’s improvements over baselines and ablation architectures.

In Figure 2, we see the failure rates. “Generator Only” (guessing based on geometry) fails often because it’s a coin toss. Baselines like FlowBot3D and Conditional Diffusion also struggle. HAVE, however, drastically reduces the failure rate. It also opens doors in fewer steps, proving it learns efficient exploration strategies.

2. Real-World Verification

Simulation is useful, but the real world is messy. The authors deployed HAVE on a Franka Emika Panda robot facing a custom ambiguous door.

Figure 3: Real World Ambiguous Door Performance: The left side is the visual appearance of the ambiguous door and the 4 modes it opens.We collect 5 trials for each mode.From the bar plot we can see that HAVE demonstrates a more stable and efcient opening process.

The results in Figure 3 mimic the simulation. The “Baseline” (FlowBot3D) often gets stuck or takes many steps to open the door. HAVE consistently achieves a 100% success rate across different modes (Push-L, Push-R, etc.) and does so with fewer mean steps. The robot tries an action, realizes it failed, and immediately pivots to the correct mode.

To visualize this “thinking” process, look at the analysis below:

Figure 18: Real World Analysis (Example 1): HAVE suppresses the failure mode “push left" despite it being frequently generated from the generator and opens the door.

In Figure 18, the generator suggests “Push Left” frequently (the tall blue bar). However, the robot previously tried pushing left and failed (Step 1). Consequently, the Verifier assigns “Push Left” a very low score (negative value in the right-hand chart) and selects “Pull Right” instead, solving the task. This is explicit history-aware reasoning in action.

3. Uneven Object Pickup

The third task involves picking up objects with unknown centers of mass—like a heavy hammer or an unevenly weighted rod. If you grab it in the geometric center, it tilts and falls. You must explore to find the balance point.

Figure 17: Uneven Object Dataset Examples: From left to right, we show examples of the visual appearance of rod, knife, complex bookmark and iregular bookmark in our uneven object dataset.

The system needs to look at how the object tilted in previous attempts to guess where the mass is.

Figure 6: Table 2: Failure Rate % given maximum steps on Ambiguous Rod dataset and Unseen dataset (↓): Lower is better. Figure 4: Visualization of action sequence (dots) and theoretical center of mass range (bars): “Conditional Diffusion" takes 5 steps to succeed, while HAVE (Ours) takes only 3 steps and each attempt is within the theoretical center of mass range.

Figure 4 (visualized on the right of the image above) is particularly illuminating.

Left (Conditional Diffusion): The baseline flails around. It tries far left, then far right, failing to narrow down the solution.
Right (HAVE): The shaded bars represent the “theoretical center of mass range” derived from previous failures. HAVE’s selected actions (red dots) stay strictly within these logical bounds. It acts like a binary search algorithm, narrowing down the possibilities until it succeeds in just 3 steps.

The failure rates (Table 2 in the image) confirm this: HAVE reduces failure rates significantly on both known rods and unseen objects (like knives and bookmarks).

Why Does HAVE Work So Well?

We can break down the success of HAVE into a few key analytical insights provided by the paper.

Robustness to Noise

The authors compared HAVE’s dual-branch architecture against a “Vanilla Transformer” that just concatenates actions and observations together.

Figure 5: Comparison between Vanilla Transformer and HAVE: Robustness to noisy observation flow.

Figure 5 shows that while both work well with perfect data (GT obs flow), HAVE is much more robust when using estimated flow (DELTA), which is what a real robot uses. By explicitly structuring the attention between “What I proposed” and “What happened historically,” HAVE is less prone to getting confused by noisy sensor data.

Sample Efficiency

How many generated actions does the verifier need to see to make a good decision?

$Figure 6: Time and Accuracy vs Number of Samples: On the left, we plot the time used for generator and verifier w.r.t the number of generated samples (with verifier history length \$= 5\$ );Onthe right, we plot the failure rate w.r.t sample count, comparing HAVE with an oracle verifier (which always selects the best sample from a given batch). See Appendix C.1 for details for oracle experiments.$

Figure 6 (Right) shows that performance improves drastically as you increase the sample count from 1 to about 5. After that, returns diminish. This is great news for real-time robotics—you don’t need to generate 1,000 samples. Just generating 5-10 candidates and verifying them is enough to beat the baseline.

Learning from Failure

Finally, the researchers visualized the internal scores of the verifier to see if it truly “understands” failure.

Figure 21: Failure History Analysis (Average Scores): Verifier score analysis under different failure histories. Actions matching failure modes receive lower scores, showing the verifier’s ability to learn from failure histories.

Figure 21 confirms this. When the history contains a failure (e.g., 1-Step Failure History), the score for that specific “Failure Mode” drops near zero, while “Other Modes” rise. As the robot accumulates more failures (2-step, 3-step), the distinction becomes even sharper. The model effectively learns: “I have tried X and Y, so the answer must be Z.”

Conclusion

The HAVE paper presents a shift in how we approach robotic manipulation in ambiguous environments. Rather than trying to force a single generative model to “know it all,” the authors demonstrate the power of decoupling.

By using an unconditional generator to provide diversity and a history-aware verifier to provide logic, robots can:

Reason about past interactions to avoid repeating mistakes.
Narrow down possibilities in ambiguous scenarios (like finding a center of mass).
Operate efficiently in the real world with noisy data.

This “Generate-then-Verify” approach mirrors how humans solve problems—we brainstorm solutions and then critically evaluate them based on our experience. Giving robots this same capability brings us one step closer to machines that can truly handle the unpredictable nature of the real world.

Introduction: The Problem of the “Norman Door”#

The Core Challenge: Ambiguity and Multimodality#

The Method: Decoupling Creativity from Logic#

1. The Generator#

2. The History-Aware Verifier (HAVE)#

Theoretical Motivation: Why Verification Wins#

Experiments and Results#

1. Multimodal Doors (The Push/Pull Problem)#

2. Real-World Verification#

3. Uneven Object Pickup#

Why Does HAVE Work So Well?#

Robustness to Noise#

Sample Efficiency#

Learning from Failure#

Conclusion#