MindRef: Teaching LLMs to Remember Like Humans Using Hierarchical Retrieval
Imagine you are trying to remember a specific detail from a book you read years ago—perhaps the name of a minor character in Harry Potter. Your brain doesn’t scan every sentence of every book you’ve ever read in a linear fashion. Instead, you likely perform a hierarchical search: first, you recall the specific book (Harry Potter and the Goblet of Fire), and then you mentally zoom in on the relevant scene to retrieve the name.
This “document-to-detail” process is efficient, flexible, and natural. However, traditional Information Retrieval (IR) systems used by AI do not work this way. Instead, they rely on pre-segmenting (chunking) massive databases of text into fixed sizes (e.g., 100-word blocks) and indexing them. When you ask a question, the system searches these millions of disjointed chunks to find a match. This often breaks context and limits flexibility.
But what if Large Language Models (LLMs) could mimic human memory? What if they could first identify the correct document and then “read” the specific passage starting from any location, without relying on arbitrary pre-cut chunks?
This is the premise of MindRef, a groundbreaking framework proposed by researchers Wang, Xu, and Ding. In this post, we will tear down their research paper to understand how they combined classical data structures (Tries and FM-Indices) with modern LLMs to create a retrieval system that is both cognitively plausible and computationally efficient.
The Problem with “Chunking”
Before diving into the solution, we need to understand the bottleneck of current methods.
In standard Retrieval-Augmented Generation (RAG) pipelines, the knowledge source (like Wikipedia) is split into chunks. If a relevant answer spans the boundary of two chunks, or if the chunk size is too small to provide context, the retrieval model fails.
Recent advances have proposed Generative Retrieval, where an LLM is trained to memorize document identifiers (like titles). When asked a question, the LLM generates the title of the relevant document. While promising, this approach usually points to a whole document (which is too long to read completely) or a pre-defined passage ID (which returns us to the chunking problem).
The researchers behind MindRef asked a simple question: “Can LLMs bypass chunking to recall references from any position?”

As illustrated above, the goal is to simulate the human two-step process:
- Stage One: Recall the broader context (the document title).
- Stage Two: Zero in on the specific details (the passage) required to answer the query.
The MindRef Framework
MindRef is a two-stage retrieval framework designed to enable “chunkless” reference localization. It doesn’t rely on an external search engine like Elasticsearch or Faiss during inference. Instead, it leverages the parameterized knowledge already stored in the LLM during pre-training, guided by clever data structures to ensure accuracy.
The architecture is elegant in its separation of concerns:

Let’s break down these two stages in detail.
Stage 1: Coarse-Grained Document Recall
The first step is narrowing down the search space from “the entire internet” to “a few relevant documents.” The model is given a query (e.g., “How old was Sasuke when his clan died?”) and must generate the title of a Wikipedia page that likely contains the answer.
However, LLMs are prone to hallucination. If we simply ask an LLM to “Name a Wikipedia page about Sasuke,” it might invent a title that doesn’t exist. To prevent this, MindRef uses Constrained Decoding via a Trie (Prefix Tree).
The Role of the Trie
A Trie is a tree data structure used to store a dynamic set of strings where the keys are usually strings. In MindRef, the researchers built a Trie containing all valid Wikipedia titles.
When the LLM generates the document title, it isn’t allowed to pick just any token from its vocabulary. It is “guided” by the Trie. If the LLM has generated “Sasuke Uch”, the Trie looks at the valid next letters stored in its branches (e.g., “i”) and forces the LLM to choose from valid continuations (leading to “Sasuke Uchiha”). This guarantees that every generated title actually exists in the database.

As seen in panel (a) of the image above, the Trie restricts the search space, ensuring valid title generation. The model generates multiple candidate titles using beam search, and each title is scored. The score for a generated title \(t\) is calculated based on the average log-probability of its tokens:

Here, \(y_t\) represents the tokens in the title, and the score reflects how confident the LLM is that this title is relevant to the prompt.
Stage 2: Fine-Grained Passage Recall
Once the system has the top \(k\) relevant documents (e.g., the Wikipedia pages for “Sasuke Uchiha” and “Itachi Uchiha”), the next challenge is finding the exact sentence or paragraph that answers the question.
We cannot simply feed the full documents into the LLM context window—they might be too long. We also don’t want to use pre-cut chunks. Instead, we ask the LLM to generate the relevant text verbatim from the document.
But again, we face the hallucination risk. The LLM might generate a sentence that sounds plausible but isn’t actually in the source text. To solve this, MindRef employs a more complex data structure: the FM-Index.
The FM-Index
While a Trie works well for short strings like titles, it is too memory-intensive to index the full text of thousands of Wikipedia articles. The FM-Index (based on the Burrows-Wheeler Transform) is a compressed data structure that allows for fast substring queries.
For the top \(k\) documents retrieved in Stage 1, MindRef constructs a targeted FM-Index. When the LLM starts generating the passage, the FM-Index acts as a constraint mechanism similar to the Trie. It dynamically checks the tokens generated so far and restricts the LLM’s next-token probability distribution to only those words that actually appear in the source documents.
The underlying mechanism relies on the Burrows-Wheeler Transform matrix, which permutes the text to make it compressible and searchable:

By using the FM-Index, the LLM can start generating text from any arbitrary position in the document. It effectively “reads” the document from memory, but the FM-Index ensures it doesn’t misquote a single word.
The scoring for this stage is similar to the first, calculating the confidence of the generated passage:

Combining the Signals
The final relevance of a retrieved passage isn’t just about how well the passage matches the query, but also how relevant the parent document was. MindRef combines the scores from both stages using a weighted sum:

Here, \(\alpha\) is a hyperparameter that balances the importance of the document title relevance versus the specific passage relevance.
Optimization: Short Prefix Recall & Localization (SPRL)
Generating a full 200-token paragraph token-by-token is slow. To make MindRef practical, the researchers introduced Short Prefix Recall and Localization (SPRL).
Instead of generating the whole passage, the LLM is prompted to generate only a short prefix (e.g., the first 16 tokens of the relevant sentence). Because of the FM-Index constraint, this short sequence is usually unique enough to pinpoint a specific location in the document.
The process works like this:
- Recall: The LLM generates a short prefix (e.g., “The main substrates of chymotrypsin are…”).
- Locate: The system uses the KMP (Knuth-Morris-Pratt) string-matching algorithm to find the exact offset of this prefix in the document.
- Expand: Once the location is found, the system simply extracts the surrounding text (e.g., the next 150 words) directly from the original file.
This technique speeds up inference by 4x while maintaining over 95% of the accuracy.
Experiments and Results
The researchers evaluated MindRef on the KILT benchmark, a suite of knowledge-intensive tasks including Open-Domain QA (Natural Questions, TriviaQA), Fact Checking (FEVER), and Dialogue (Wizard of Wikipedia).
They compared MindRef against:
- BM25: The standard keyword-based sparse retrieval method.
- Contriever: A popular unsupervised dense retrieval model.
- DPR: Dense Passage Retrieval (supervised/fine-tuned).
1. Can it find the right document? (Page-Level Results)
The first test was whether MindRef could identify the correct Wikipedia page.

As shown in Table 1, MindRef (using LLaMA-2-13b) dominates the unsupervised baselines. On the FEVER dataset (Fact Checking), it achieved an R-Precision of 83.69, vastly outperforming BM25 (52.09). It even outperformed the fully supervised DPR model on several tasks. This confirms that LLMs have a strong internal “index” of knowledge that can be unlocked via title generation.
2. Can it find the right answer? (Passage-Level Results)
Finding the document is only half the battle. Table 2 shows how well MindRef retrieved the specific passage containing the answer.

MindRef consistently achieved the highest scores on TriviaQA, HotpotQA, and FEVER. While DPR performed better on Natural Questions (NQ), MindRef’s ability to locate passages without pre-training on retrieval data is impressive. It shows that the “recall then locate” method is a viable alternative to “index and search.”
3. Does it help downstream tasks?
Finally, the researchers fed the retrieved passages into a reader model to answer the actual user questions.

The results in Table 3 confirm that better retrieval leads to better answers. MindRef-boosted LLaMA-2 achieved state-of-the-art results on FEVER (78.79% accuracy) and strong performance on TriviaQA.
Ablation Studies: What matters most?
The researchers stripped parts of the model away to see what was driving performance.

- w/o weight: Removing the weighted score (Equation 3) hurt performance, proving that document relevance (Stage 1) is crucial context for passage selection.
- w/o SPRL: Surprisingly, removing the short prefix optimization (and generating full text instead) sometimes decreased performance. The authors suggest that generating longer text introduces more chances for the beam search to drift or introduce noise.
- w/o first stage: Skipping the document recall and trying to search the whole corpus immediately caused a massive drop in accuracy. The hierarchical approach is essential.
Analysis: The Impact of Prefix Length
One of the most interesting analyses in the paper is regarding the length of the prefix used in SPRL. Intuitively, you might think a longer prefix leads to better accuracy.

Figure 5 reveals a counter-intuitive finding: Longer isn’t always better. On the HotpotQA and NQ datasets, performance peaks at a relatively short prefix length (around 8-16 tokens) and then degrades or plateaus. This validates the SPRL strategy: the LLM only needs to remember just enough to “point” to the location. Once the location is identified, the hard disk can do the rest of the work.
Conclusion and Future Implications
MindRef represents a significant shift in how we think about retrieval. Instead of treating text as a collection of static chunks to be indexed by an external search engine, MindRef treats the text as a continuous stream that an LLM can navigate using its internal memory.
Key Takeaways:
- Mimicking Humans: The two-stage “Document Title -> Specific Passage” workflow aligns closer to cognitive processes than traditional dense retrieval.
- No Pre-Chunking: By using FM-Indices, retrieval can start at any character, offering granular flexibility that chunk-based methods lack.
- Efficiency: The SPRL method proves that LLMs function best as “pointers” rather than “readers” during the retrieval phase—recalling a short snippet is enough to find the full data.
Limitations: While MindRef is powerful, it relies heavily on the LLM’s pre-training. If a document title is obscure or wasn’t in the training data, the model struggles to recall it in Stage 1. Furthermore, updating the model with new knowledge requires retraining (or complex prompting), whereas a traditional search engine just needs to index the new file.
Nevertheless, MindRef offers a glimpse into a future where the boundary between “memory” (the neural network) and “storage” (the database) becomes increasingly blurred, leading to more fluid and capable AI systems.
](https://deep-paper.org/en/paper/2402.17010/images/cover.png)