Introduction: The Problem of the “Norman Door”
We have all been there. You walk up to a door, grab the handle, and pull. It doesn’t budge. You realize, slightly embarrassed, that you were supposed to push.
This scenario, often called the “Norman Door” problem, highlights a fundamental challenge in robotics: visual ambiguity. To a robot’s camera, a push-door and a pull-door often look identical. Two boxes might look the same, but one is empty while the other is filled with lead bricks on the left side.
In the robotics world, a system relying solely on a single snapshot of the world is doomed to fail in these scenarios. To succeed, a robot must act like a human: try an action, observe the outcome (did the door move?), update its internal model, and try again. This is interactive perception.
However, teaching robots to learn from history effectively is notoriously difficult. Today, we are diving deep into a new paper, “Learn from What We HAVE,” which proposes a novel architecture called the History-Aware VErifier (HAVE). Instead of trying to train a single massive “brain” to predict the perfect action from history, the researchers propose a smarter split: use a creative generator to brainstorm options, and a strict, history-aware judge to verify them.
The Core Challenge: Ambiguity and Multimodality
In a Partially Observable Markov Decision Process (POMDP), the robot doesn’t know the true state of the world (e.g., the friction of a hinge or the center of mass). It only sees observations (point clouds or images).
When an environment is ambiguous, there are often multiple valid ways to interact with it, or “modes.” For a closed door, valid initial actions might include “push left,” “push right,” “pull left,” or “pull right.”
The standard Deep Learning approach is Conditional Generation: training a diffusion model or a policy that takes the history of interactions as input and outputs the next action. While this sounds logical, it often fails in practice. Why?
- Mode Collapse: Conditional models often struggle to capture all distinct possibilities (modes), reverting to an “average” behavior that fails at everything.
- Data Inefficiency: Training a model to implicitly understand physics from video history requires massive amounts of data.
The authors of HAVE take a different route inspired by recent successes in Large Language Models (LLMs): The Generation-Verification Paradigm.
The Method: Decoupling Creativity from Logic
The core insight of HAVE is that generating potential actions and selecting the best action are two different problems that should be solved separately.
The system consists of two distinct modules:
- The Generator: An unconditional diffusion model that proposes multiple candidate actions based only on the current observation. It doesn’t care about history; it just suggests actions that are geometrically plausible (e.g., “here is a handle, maybe grab it”).
- The Verifier (HAVE): A history-aware model that scores those proposals. It looks at the past (e.g., “we just tried grabbing that handle and pulling, and it failed”) and assigns high scores to actions likely to succeed and low scores to actions likely to repeat past failures.
1. The Generator
The generator uses a Diffusion Transformer (DiT). It takes a noisy 3D Articulation Flow and the current point cloud observation, iteratively denoising it to produce a clean action proposal.

As shown in Figure 12, the generator uses PointNet++ to encode the geometry. Importantly, this generator is unconditional regarding history. It simply asks: “What are valid ways to grasp this object?” This ensures high diversity in the proposals, preventing the robot from getting stuck in a loop of trying the same failed action.
2. The History-Aware Verifier (HAVE)
This is the heart of the paper. The verifier’s job is to look at a batch of actions proposed by the generator and pick the winner.
The architecture (Figure 1) is designed specifically for reasoning about cause and effect.

How it works:
- Encoders: The system encodes the Action Proposal (the candidate being judged), the History of Actions (what the robot did before), and the History of Results (how the object moved—or didn’t move—after previous actions).
- Observation Flow: The “Results” aren’t just static images. The model calculates the 3D flow between timesteps to explicitly represent movement.
- Attention Mechanism: This is the critical step. The model uses an attention layer where:
- Query (Q): The current Action Proposal.
- Key (K): Past Actions.
- Value (V): Past Results.
This structure forces the network to ask: “How similar is this proposed action (Q) to what I did before (K), and what was the result (V)?” If the proposal is similar to a past action that resulted in zero movement (failure), the network learns to assign it a low score.
Theoretical Motivation: Why Verification Wins
You might ask: “Why not just train a better generator?”
The authors provide a compelling theoretical justification. They prove that even if your generator is mediocre, a verifier that is only slightly better than random guessing can significantly improve the expected reward.

Figure 7 illustrates this concept. Even if the generator (Success Probability \(p_G\)) is low (e.g., 0.2), using a verifier with reasonable accuracy (\(p_V\)) allows the system’s performance to skyrocket as you sample more candidates (\(N\)).
By sampling \(N\) actions, you increase the probability that at least one “good” action exists in the batch. The verifier simply needs to identify it. This is far easier than forcing a generator to output the single best action on the first try.
Experiments and Results
The team tested HAVE in simulation and the real world across three distinct tasks that involve fundamental ambiguity.
1. Multimodal Doors (The Push/Pull Problem)
They created a dataset of doors that look identical but open in different directions (push/pull, left/right).

In Figure 2, we see the failure rates. “Generator Only” (guessing based on geometry) fails often because it’s a coin toss. Baselines like FlowBot3D and Conditional Diffusion also struggle. HAVE, however, drastically reduces the failure rate. It also opens doors in fewer steps, proving it learns efficient exploration strategies.
2. Real-World Verification
Simulation is useful, but the real world is messy. The authors deployed HAVE on a Franka Emika Panda robot facing a custom ambiguous door.

The results in Figure 3 mimic the simulation. The “Baseline” (FlowBot3D) often gets stuck or takes many steps to open the door. HAVE consistently achieves a 100% success rate across different modes (Push-L, Push-R, etc.) and does so with fewer mean steps. The robot tries an action, realizes it failed, and immediately pivots to the correct mode.
To visualize this “thinking” process, look at the analysis below:

In Figure 18, the generator suggests “Push Left” frequently (the tall blue bar). However, the robot previously tried pushing left and failed (Step 1). Consequently, the Verifier assigns “Push Left” a very low score (negative value in the right-hand chart) and selects “Pull Right” instead, solving the task. This is explicit history-aware reasoning in action.
3. Uneven Object Pickup
The third task involves picking up objects with unknown centers of mass—like a heavy hammer or an unevenly weighted rod. If you grab it in the geometric center, it tilts and falls. You must explore to find the balance point.

The system needs to look at how the object tilted in previous attempts to guess where the mass is.

Figure 4 (visualized on the right of the image above) is particularly illuminating.
- Left (Conditional Diffusion): The baseline flails around. It tries far left, then far right, failing to narrow down the solution.
- Right (HAVE): The shaded bars represent the “theoretical center of mass range” derived from previous failures. HAVE’s selected actions (red dots) stay strictly within these logical bounds. It acts like a binary search algorithm, narrowing down the possibilities until it succeeds in just 3 steps.
The failure rates (Table 2 in the image) confirm this: HAVE reduces failure rates significantly on both known rods and unseen objects (like knives and bookmarks).
Why Does HAVE Work So Well?
We can break down the success of HAVE into a few key analytical insights provided by the paper.
Robustness to Noise
The authors compared HAVE’s dual-branch architecture against a “Vanilla Transformer” that just concatenates actions and observations together.

Figure 5 shows that while both work well with perfect data (GT obs flow), HAVE is much more robust when using estimated flow (DELTA), which is what a real robot uses. By explicitly structuring the attention between “What I proposed” and “What happened historically,” HAVE is less prone to getting confused by noisy sensor data.
Sample Efficiency
How many generated actions does the verifier need to see to make a good decision?

Figure 6 (Right) shows that performance improves drastically as you increase the sample count from 1 to about 5. After that, returns diminish. This is great news for real-time robotics—you don’t need to generate 1,000 samples. Just generating 5-10 candidates and verifying them is enough to beat the baseline.
Learning from Failure
Finally, the researchers visualized the internal scores of the verifier to see if it truly “understands” failure.

Figure 21 confirms this. When the history contains a failure (e.g., 1-Step Failure History), the score for that specific “Failure Mode” drops near zero, while “Other Modes” rise. As the robot accumulates more failures (2-step, 3-step), the distinction becomes even sharper. The model effectively learns: “I have tried X and Y, so the answer must be Z.”
Conclusion
The HAVE paper presents a shift in how we approach robotic manipulation in ambiguous environments. Rather than trying to force a single generative model to “know it all,” the authors demonstrate the power of decoupling.
By using an unconditional generator to provide diversity and a history-aware verifier to provide logic, robots can:
- Reason about past interactions to avoid repeating mistakes.
- Narrow down possibilities in ambiguous scenarios (like finding a center of mass).
- Operate efficiently in the real world with noisy data.
This “Generate-then-Verify” approach mirrors how humans solve problems—we brainstorm solutions and then critically evaluate them based on our experience. Giving robots this same capability brings us one step closer to machines that can truly handle the unpredictable nature of the real world.
](https://deep-paper.org/en/paper/2509.00271/images/cover.png)