Stop Guessing, Start Planning: How Blueprints Solve LLM Hallucinations

We have all seen it happen. You ask a Large Language Model (LLM) to write a biography about a niche author or summarize a recent news event. The output looks perfect—the grammar is flawless, the tone is authoritative, and the structure is logical. But upon closer inspection, you realize the model has invented a university degree the author never earned or cited an award that doesn’t exist.

This phenomenon is known as hallucination. It occurs because LLMs are probabilistic engines, not databases. When they reach the limits of their internal (parametric) memory—often due to the obscurity of the topic or the recency of the event—they fill in the gaps with statistically plausible, yet factually incorrect, information.

In a recent paper titled “Analysis of Plan-based Retrieval for Grounded Text Generation,” researchers from Google and UMass Amherst propose a compelling solution. They suggest that the key to fixing hallucinations isn’t just giving the model access to Google Search; it is teaching the model to plan its research before it writes a single word.

In this post, we will deconstruct their method, explore why “planning” is the missing link in Retrieval Augmented Generation (RAG), and analyze the empirical evidence showing how this approach leads to more trustworthy AI.

The Problem: The Confidence Trap

To understand the severity of the problem, consider the paper’s opening example. When the Falcon 180B model was asked to write a bio for the author Lorrie Moore, it produced a convincing paragraph. It listed her birth year, her education, and her awards.

The problem? Almost all the specific details were wrong. It claimed she was born in Glasgow, Kentucky (she was born in New York) and that she earned her MFA from the University of Wisconsin (she went to Cornell).

This is the “Confidence Trap.” The model generates text that looks right because it mimics the style of a biography, but it fails to ground that text in reality.

The Limits of Standard Retrieval

The standard fix for this is Retrieval Augmented Generation (RAG). In a typical RAG setup, the system takes the user’s prompt (e.g., “Write a bio for Lorrie Moore”), runs a web search, pastes the top search results into the model’s context window, and says, “Use these results to answer the user.”

While this helps, the researchers found it insufficient for long-form writing. A single generic search often fails to capture the fine-grained details needed for a comprehensive biography. It might find a general Wikipedia summary, but miss specific details about early life, specific literary themes, or a complete list of awards.

The Solution: Plan-Based Retrieval

The core contribution of this research is a method that mimics how a human researcher operates. If you were writing a biography, you wouldn’t just type a name into a search engine and copy the first result. You would:

  1. Outline the sections you want to write (Early Life, Career, Awards).
  2. Formulate specific questions for each section.
  3. Research those specific questions.
  4. Write the content based on the answers found.

The researchers implemented this exact workflow for LLMs.

The Architecture

The method, visualized below, breaks the generation process into distinct steps.

Figure 1: Summary of Planning and Retrieval used to generate text. Given an initial prompt, a plan is first generated that outlines the segments to be writen.Next,search queries are generated for each segment which are then used for fine-grained retrieval retrieval of source documents.The final response is generated conditioned on the plan, the queries and the retrieved document.

Let’s break down the flow shown in Figure 1:

  1. Initial Prompt & Search: The user asks for a bio. The system does a quick, high-level search to get basic context.
  2. QA Plan Generation: The model is prompted to create a “blueprint.” It decides, for example, that Paragraph 1 should cover early life, Paragraph 2 should cover writing style, and Paragraph 3 should cover awards.
  3. Question Generation: For each paragraph in the plan, the model generates specific search queries (e.g., “Where was Lorrie Moore born?”, “What awards did Lorrie Moore win?”).
  4. Fine-Grained Retrieval: The system performs searches for these specific questions, gathering targeted evidence.
  5. Grounded Generation: The final answer is generated using the blueprint and the specific answers found.

This approach transforms the LLM from a passive text predictor into an active researcher.

Handling the “Unanswerable”

One of the most clever aspects of this method is how it handles missing information. In the Question Answering (QA) phase, if the retrieval system cannot find a confident answer to a generated question (e.g., “What is Lorrie Moore’s favorite color?”), the system marks it as unanswerable.

In the final prompt, the model is explicitly told not to write about questions that lack evidence. This simple step—acknowledging ignorance—significantly reduces the rate at which the model makes things up.

Experimental Setup

To test this hypothesis, the authors compared their Plan-Based Retrieval against standard methods across four diverse datasets:

  • Wiki-Ent (Head): Popular entities with Wikipedia pages.
  • Wiki-Event (Head): Famous historical events.
  • Researcher (Tail): A challenging dataset of 106 researchers who are notable but do not have Wikipedia pages. This tests the model’s ability to handle “long-tail” knowledge.
  • News Events (Recency): Breaking news events that occurred after the model’s training cutoff.

They evaluated the models using AIS (Attributable to Identified Sources). This metric checks every sentence generated by the model and verifies if it is supported by the retrieved documents.

Key Results

The results were decisive: Planning significantly outperforms standard retrieval.

1. Superior Attribution

The table below shows the performance using the text-unicorn model (a large, capable PaLM-2 variant).

Table 2: Comparison of Generation Approaches using text-unicorn-001 model. We observe that Plan-based Retrieval improves upon One-Retrieval and No Retrieval methods,even when One-Retrieval retrieves more results. Plan-based Retrieval Var.B produces much longer texts, which are more atributable compared to One-Retrieval. Var.A produces slightly more attributed texts and at shorter length.

Looking at Table 2, pay attention to the “AIS Strict” column (which requires every sentence in the output to be supported by evidence):

  • No Retrieval: Scores near 0.00. Without external data, the model almost always hallucinates at least one detail.
  • One-Retrieval (Standard RAG): Scores around 60-68%. Better, but still prone to errors.
  • Plan-based Retrieval (Var A & B): Jumps to 80-88%.

This is a massive improvement. By simply structuring the retrieval process, the researchers reduced the hallucination rate dramatically. Even when they doubled the number of search snippets for the Standard RAG approach (“2x snippets”), it still couldn’t beat the Plan-based approach. This proves that how you search matters more than how much you search.

2. The Power of “Chain of Thought” Outlining

Is the outline really necessary? Could we just ask the model to generate questions immediately?

The researchers ran an ablation study to find out. They compared the full method against a version that skipped the “paragraph outline” step and went straight to generating search questions.

Table 5: Importance of Using Outline for Question Generation (on News Events).We compare different methods of retrieval of information while using the same final generation prompt with text-unicorn-Oo1. We see that using the chain-of-thought style paragraph outline helps to produce more grounded responses than an approach that simply generates questions from the initial search results.

As shown in Table 5, the “w/o plan” version dropped significantly in strict attribution (from 87.18 down to 68.59).

This supports the “Chain of Thought” theory: forcing the model to articulate a high-level plan first (the blueprint) helps it generate better, more relevant questions later. The planning step acts as a cognitive scaffold, ensuring the search queries are comprehensive and logically organized.

3. Qualitative Improvements

Beyond the numbers, the plan-based approach produces better content. It avoids the “repetitive stitching” often seen in standard RAG, where models just paste together search snippets.

Figure 2: Example Generation. One of the hallucinations in the One-Retrieval model is the focus of one of the questions provided in the question-based plan.

Figure 2 illustrates this with the example of the “2023 Johannesburg building fire.”

  • One-Retrieval (Standard): The model hallucinates that the fire started on the “second floor.”
  • Plan-based Retrieval: The model explicitly asked “What was the cause?” and “Where did it start?” during the planning phase. The retrieval returned that the cause was unknown. Consequently, the Plan-based generation correctly states the cause is under investigation, avoiding the hallucination.

4. Human Preference

Does this rigorous fact-checking make the text dry or robotic? Surprisingly, no.

Table 9: Head-to-head Informativeness Comparison. Plan-based Retrieval leads to model generations that are more informative about \\(55 \\%\\) more times than standard retrieval. The improvements are clearer when the LLM is queried about long-tail entities and recent events.

In a head-to-head human evaluation (Table 9), annotators found the Plan-based generations to be more informative about 55% of the time, while maintaining equal fluency. Because the model plans its content, it covers more ground and provides richer details than a model simply summarizing a top-ranking search result.

Generalization: It Works on Open Source Models

A common criticism of such papers is that they only work on proprietary, massive models like Google’s PaLM or OpenAI’s GPT-4. To address this, the authors tested their method on Mistral-7B-Instruct, a popular open-weight model.

Table 7: Comparison of Generation Approaches using Mistral-7B-Instruct-v0.3. We see that Plan-based Retrieval (Var.B) leads to the generation of more attributable text than the One-Retrieval and One-Retrieval( \\(2 \\mathbf { x }\\) snippets)baselines methods.In this setting,Plan-based Retrieval (Var.A) improves over One-Retrieval but not One-Retrieval ( \\(2 \\mathbf { x }\\) snippets). These results demonstrate that plan-based retrieval with model-specific tuning is a useful approach for grounded, long-form generation.

Table 7 shows that the benefits hold true even for smaller, open models. Plan-based Retrieval (Var. B) achieved a strict AIS score of 25.0, significantly higher than the standard One-Retrieval score of 11.7. This is encouraging for students and developers who want to implement grounded generation systems without relying solely on massive API-based models.

Why Does the Second Search Matter?

You might wonder: Does the system really need to search the web twice? (Once for the initial context, and again for the specific questions).

The data says yes.

Table 3: Importance of Gathering Additional Information from Second Search.We compare Plan-based Retrieval (Var.B) with text-bison-O01 in two settings, where we use the typical (secondary retrieval step) and using only the original search results as the source for the documents. We measure performance on the WikiEnt dataset.We find that atribution is indeed improved by gathering more facts and information.

Table 3 compares the full method against a version that plans questions but tries to answer them using only the initial search results (“w/o 2nd search”). The attribution score drops when the second search is removed.

This confirms that the “Planning” step isn’t just about organizing thoughts; it is about information discovery. The initial broad search simply doesn’t contain the specific details needed to answer specific questions like “What year did the subject graduate?” or “What was the cause of the event?”. The plan guides the retriever to find information that would otherwise be missed.

Conclusion and Implications

The paper “Analysis of Plan-based Retrieval for Grounded Text Generation” provides a clear roadmap for the future of reliable AI writing. The takeaways are significant for anyone building RAG systems:

  1. Don’t just Retrieve, Plan: A simple “Search \(\rightarrow\) Generate” loop is insufficient for complex topics. Adding a “Plan \(\rightarrow\) Question \(\rightarrow\) Search” loop drastically improves reliability.
  2. Explicitly Handle the Unknown: Designing systems that can identify and label “unanswerable” questions is a powerful defense against hallucination.
  3. Structure Beats Volume: Better prompts and structured workflows often yield better results than simply increasing the number of documents retrieved.

As we move toward AI agents that perform research and report writing, methods like Plan-Based Retrieval move us away from stochastic parrots and toward systems that can reason, research, and cite their sources with genuine authority.