Learn from What We HAVE: History-Aware VErifier that Reasons about Past Interactions Online

Anonymous Authors

Abstract

We introduce a novel History-Aware VErifier (HAVE) to disambiguate uncertain scenarios online by leveraging past interactions. Robots frequently encounter visually ambiguous objects whose manipulation outcomes remain uncertain until physically interacted with. While generative models alone could theoretically adapt to such ambiguity, in practice they obtain suboptimal performance in ambiguous cases, even when conditioned on action history.

To address this, we propose explicitly decoupling action generation from verification: we use an unconditional diffusion-based generator to propose multiple candidate actions and employ our history-aware verifier to select the most promising action by reasoning about past interactions.

Through theoretical analysis, we demonstrate that employing a verifier significantly improves expected action quality. Empirical evaluations and analysis across multiple simulated and real-world environments including articulated objects, multi-modal doors, and uneven object pick-up confirm the effectiveness of our method and improvements over baselines.

Real World Demos

We demonstrate the performance of HAVE on a real world ambiguous door:

HAVE can explore possible open modes efficiently, avoiding repeating failure actions.

Example 1:

In a pull-left door, the agent succeeds after 3 steps: the verifier suppresses the failed pull-right mode after step 1, and both generator and verifier recognize the slight opening after step 2 to guide the correct action.

Example 2:

In a pull-right door, the agent fails with push-left at step 1, and with the verifier consistently suppressing this failure mode, they predict the correct action in the second step despite generator ambiguity.

Theoretical Motivation

Setting : Discrete Reward. An action is either good (reward = 1) or bad (reward = 0).

The GENERATOR proposes action candidates \( a \sim \pi_G \), with accuracy \(p_G = P\bigl(R_{gt}(a)=1\bigr)\).
The VERIFIER \(V(a)\in\{0,1\}\) predicts whether an action is good with accuracy \(p_V = P\bigl(V(a)=R_{gt}(a)\bigr)\).

We sample N independent candidates from the generator, and execute the candidate with the highest verifier score

(ties—and the all-zero case—are resolved arbitrarily).

Question: Does this “pick-best-by-verifier’’ rule improve the expected reward over picking a single random action?

Theorem A (Independent Case). If the verifier’s overall accuracy satisfies \(p_V>0.5\) and \(N\ge2\), then selecting the top-scoring candidate increases the expected reward compared with using one generator sample. (click to show proof)

Theorem A′ (General Case). Let \(p_{V1}\) be the verifier’s true-positive rate (good→1) and \(p_{V0}\) its true-negative rate (bad→0). If \(p_{V1}+p_{V0}>1\) and \(N\ge2\), the same selection rule still improves expected reward. (click to show proof)

See Appendix A for Numerical Simulations, and extension to Continuous Reward;

See Section 4.1 for Generation-Verification Gap

Quantitative Results

Simulation: Ambiguous Door

Real World: Ambiguous Door

Articulated Objects (Full Set)

Articulated Objects (Held-out)

Uneven Object Pick-up

▶ Legend for Articulated Objects

Analysis

Performance with Different Sample Count

Time and Accuracy vs Number of Samples: On the left, we plot the time used for generator and verifier w.r.t the number of generated samples; On the right, we plot the failure rate w.r.t sample count, comparing HAVE with an oracle verifier (which always selects the best sample from a given batch).

Robustness to Noisy Flow

Robustness to Noisy Flow: HAVE outperforms a vanilla transformer baseline under noisy observation flow, demonstrating improved robustness across all articulated object environments.

Performance with History Length

History Length Analysis: Increasing history length from 1 to 5 steps significantly improves verifier performance, while longer histories offer diminishing improvements due to redundancy and potential instability.

Learning from Failure History (Articulated)

Failure History Analysis: Actions matching failed modes receive lower verifier scores, leading to nearly 100% of top-selected actions avoiding repeated failures, demonstrating HAVE’s ability to learn from failure histories.

Learning from Success History (Articulated)

Success History Analysis: Given a successful history, the verifier accurately identifies better grasp points and directions under severe occlusion, demonstrating its ability to generalize from past successes even with large viewpoint changes.

Learning from Failure History (Uneven)

Failure History (Uneven Objects): HAVE selects actions within the theoretical center of mass range at each step, converging to the ground truth with fewer attempts, while the generator baseline fails to leverage informative failure cues.

Model Architecture

HAVE scores action candidates by encoding past actions and results with PointNet++ and attention.