Introduction

We often expect Artificial Intelligence to be an oracle: a system that provides the correct answer every time. But in the dynamic, messy real world, this expectation is unrealistic. Large Language Models (LLMs) and vision-language systems frequently “hallucinate”—they generate confident-sounding instructions that are factually incorrect or physically impossible to follow.

In the context of navigation—imagine a robot or an AI assistant guiding a visually impaired person through a building—a hallucination can be more than just annoying; it can be disorienting. If an AI tells you to “walk through the glass door” when there is only a solid wall, trust breaks down immediately.

But what if the AI didn’t need to be perfect? What if, instead of pretending to know everything, the AI could admit when it’s unsure?

This implies a shift from autonomous problem solving to collaborative problem solving. This blog post explores the research paper “Successfully Guiding Humans with Imperfect Instructions by Highlighting Potential Errors and Suggesting Corrections,” which introduces a system called HEAR (Hallucination DEtection And Remedy).

Figure 1: HEAR detects errors in a navigation instruction and suggests corrections.

As illustrated in Figure 1, rather than simply generating a path, HEAR detects potentially wrong parts of its own instructions, highlights them for the user, and suggests alternatives. The results are surprising: showing users where the AI might be wrong actually helps them reach their destination significantly faster and more reliably.

Background: The Challenge of Grounded Instruction

The specific problem addressed in this research is Vision-and-Language Navigation (VLN). In this setup, a “Speaker” model looks at a route through a 3D simulated environment (specifically, the Matterport3D simulator) and generates natural language instructions to guide a human.

For example, a route might involve moving from a bedroom to a kitchen. The model analyzes the sequence of images along the path and outputs: “Walk out of the bedroom, turn left, and stop by the kitchen counter.”

The Hallucination Problem

State-of-the-art models (often based on architectures like T5) frequently generate hallucinations—phrases inconsistent with the visual reality. The researchers categorize these into two distinct types:

Intrinsic Hallucinations: The object or direction exists but is described incorrectly.

Example: The instruction says “turn right,” but the correct path is “turn left.” Or it says “past the blue couch,” but the couch is red.

Extrinsic Hallucinations: The instruction describes something that doesn’t exist at all on the route.

Example: “Walk down the hallway,” when the room actually leads directly outside with no hallway involved.

Upon inspecting standard speaker models, the researchers found that 67.5% of generated instructions contained hallucinations. This high error rate makes raw model outputs dangerous to rely on.

The Core Method: HEAR

To solve this, the researchers developed HEAR. The system does not try to generate a perfect instruction from scratch. Instead, it acts as a post-processing layer that audits the generated instruction, identifying risks and proposing fixes.

The architecture is split into two distinct models:

Hallucination Detection: Finds the errors.
Correction Suggestion: Suggests how to fix them.

Figure 2: Our hallucination detection model (top) and hallucination type classification model (bottom).

1. Hallucination Detection

The first component (shown in the top half of Figure 2) is a binary classifier. It takes the visual route and the generated text as input. It looks at specific phrases within the instruction—identified using Part-of-Speech tagging—and determines if a specific phrase matches the visual evidence.

The model is fine-tuned from Airbert, a vision-language model pre-trained on captioning household scenes. It essentially asks: Does the phrase “turn right” match the visual transition from this image to the next? If the answer is no, the phrase is flagged as a hallucination.

2. Suggesting and Ranking Corrections

Once a phrase is flagged, the system needs to offer a solution. It generates a list of candidates (e.g., “turn left,” “go straight,” “stop”) and ranks them.

This creates a complex decision matrix. Should the system replace the phrase (Intrinsic error)? Or should it delete the phrase entirely (Extrinsic error)?

To handle this, the researchers use a scoring function that combines two probabilities. If we have a potentially erroneous phrase \(x\) and a suggested replacement \(\hat{x}\), the ranking score \(R(\hat{x})\) is calculated as:

Equation for ranking corrections.

Let’s break this equation down:

\(P_{I}(z=1 | x, y_{x}=1)\): This is the probability that the error is Intrinsic (requires replacement) rather than Extrinsic (requires deletion).
\(P_{H}(y=1 | \hat{x})\): This is the probability that the new suggestion \(\hat{x}\) is actually a valid hallucination-free description.

Essentially, the system asks: How likely is it that we need a replacement, and how good is this specific replacement?

3. Synthetic Training with “Broken” Data

A major hurdle in training these models is the lack of labeled data. We don’t have massive datasets of “bad instructions” paired with “corrections.”

The authors devised a clever solution: Synthetic Data Generation. They took correct, human-annotated instructions and used GPT-3.5 and GPT-4 to intentionally “break” them.

Rule-based perturbations: Swapping room names (e.g., replacing “kitchen” with “bedroom”).
LLM perturbations: Asking GPT-4 to rewrite directional commands to be the opposite of the truth (e.g., changing “walk through the door” to “walk past the door”).

This created a massive dataset of positive and negative examples to train the detection and correction models without expensive human annotation.

The User Interface

The technical backend is only half the battle. The success of HEAR depends on how this uncertainty is communicated to the user.

If you overwhelm a user with probability scores (e.g., “75% chance this is wrong”), they may get confused. Instead, HEAR uses a clean, intuitive design:

Highlights: Potential errors are highlighted in orange.
Interaction: Users can click a highlighted phrase to see a dropdown menu of the top-3 ranked corrections.

Figure 6: The interface used by the HEAR and Oracle systems.

As shown in Figure 6 above, the user sees the instruction. If they suspect the orange text “turn right” is wrong, they click it, and the system might suggest “turn left” or “go straight.” This keeps the mental load low—information is only provided on demand.

Experiments and Results

The researchers conducted two types of evaluations: intrinsic (testing the model’s accuracy) and extrinsic (testing how well humans navigate using the system).

Intrinsic Model Performance

First, can HEAR actually find errors? The team compared HEAR against random baselines and ablated versions of their model.

Table 1: Intrinsic evaluation of HEAR and our baseline systems.

Table 1 shows that HEAR significantly outperforms random guessing. While it isn’t perfect (F-1 score of 66.5 on the test set), it provides a high Recall@3 of 70.6%. This means that the correct fix is in the top 3 suggestions roughly 70% of the time. This is reliable enough to be useful to a human.

Extrinsic Human Evaluation

The real test was with 80 human users navigating virtual environments. They compared five setups:

No communication: Standard instructions, no warnings.
HEAR (no suggestion): Highlights errors but offers no fixes.
HEAR: Highlights errors and suggests fixes.
Oracle (no suggestion): Perfect (human-labeled) highlights.
Oracle: Perfect highlights and suggestions.

The results were statistically significant and highly encouraging.

Figure 3: Performance measured by success rate and navigation error.

As seen in Figure 3 (Left), HEAR increased the success rate by roughly 13% compared to providing no communication (rising from ~68% to ~78%). Furthermore, the navigation error (distance from the goal) dropped by 29%.

Notably, HEAR performed competitively with the “Oracle” systems. This implies that even though the AI’s error detection isn’t perfect, it’s “good enough” to trigger better human decision-making.

Why Did Performance Improve?

The improvement wasn’t just because the AI gave the right answers. It was because the AI changed the human’s behavior.

Looking at the “Checks” chart in Figure 9 (below), we see that users provided with highlights and suggestions used the “Check if I’m at the goal” button significantly more often.

Figure 9: Number of check-button clicks when users succeed and fail on the task.

By highlighting uncertainty, HEAR prevented users from blindly following instructions. It encouraged them to:

Stop and think.
Look around the environment more carefully.
Verify their position more frequently.

Even when the highlights were slightly wrong, they signaled to the user: “Be careful here.”

Qualitative example of user success.

Figure 4 (above) illustrates a success case. The blue path shows a user guided by HEAR correctly turning left because the system flagged “turn right” as an error. The red path shows a user without HEAR blindly following the bad instruction and failing.

The “Complementary” Effect

One of the most fascinating findings is shown in the qualitative analysis below.

Detail from Figure 4 showing highlights and suggestions.

In this specific case, the instruction was wrong, and HEAR highlighted it. However, the top suggestion ([deleted]) was also technically incorrect. Yet, the user still succeeded.

Why? Because the highlight and the confusing suggestions reinforced the user’s suspicion that the instruction was garbage. The user stopped trusting the specific words and used their own intuition to analyze the scene, eventually finding the correct path. The AI didn’t provide the answer; it provided the doubt necessary for the human to find the answer.

Conclusion

The HEAR system demonstrates a vital lesson for the future of AI development: Perfection is not the only path to utility.

Trying to build a hallucination-free language model is an ongoing struggle. However, this research shows that we can dramatically improve human-AI collaboration simply by communicating uncertainty.

By equipping models with the ability to say “I might be wrong about this phrase, maybe try X instead,” we transform the AI from an unreliable authority into a helpful, albeit imperfect, assistant. The 13% increase in navigation success suggests that future systems shouldn’t just focus on generating better text, but on better meta-cognition—knowing what they don’t know, and sharing that with the user.

This approach creates a symbiotic relationship where the AI reduces the search space, and the human applies common sense and visual verification to close the gap.

Introduction#

Background: The Challenge of Grounded Instruction#

The Hallucination Problem#

The Core Method: HEAR#

1. Hallucination Detection#

2. Suggesting and Ranking Corrections#

3. Synthetic Training with “Broken” Data#

The User Interface#

Experiments and Results#

Intrinsic Model Performance#

Extrinsic Human Evaluation#

Why Did Performance Improve?#

The “Complementary” Effect#

Conclusion#