The End of the Retriever? How RICHES Fuses Search and Generation into One Model

The current standard for making Large Language Models (LLMs) factual is Retrieval Augmented Generation, or RAG. The premise is simple: before the LLM answers a question, a separate “retriever” system scans a database, finds relevant documents, and pastes them into the LLM’s context window.

It works, but it is architecturally clunky. You have two distinct models—a dense retriever (like a dual-encoder) and a generator (the LLM)—that often don’t speak the same language. They have to be trained or tuned separately, and the pipeline requires handing off data between systems.

What if we didn’t need a separate retriever? What if the LLM could simply “generate” the document it needs from the database, verbatim, as part of its natural writing process?

This is the question posed by researchers at Google DeepMind in their paper “From RAG to RICHES: Retrieval Interlaced with Sequence Generation.” They propose a system that eliminates the separate retrieval step entirely, allowing a single LLM to perform multi-hop reasoning and evidence retrieval in a single decoding pass.

Example RICHES outputs for multi-hop queries with a single LLM and decoding pass. The green quoted text is “retrieved” or generated verbatim from the retrieval corpus. RICHES generation natively interleaves thoughts and multiple retrieval evidences.

As shown in Figure 1, the model doesn’t just answer; it thinks, decides it needs evidence, “generates” that evidence directly from a corpus (represented by the green text), and then synthesizes an answer.

The Problem with Pipelines

To understand why RICHES is significant, we have to look at the friction in current RAG systems.

In a standard setup, you have a “Retriever” and a “Reader.” The Retriever converts your query into a vector embedding and searches a massive vector database for similar embeddings. This relies heavily on the quality of the embeddings. If the Retriever’s training distribution differs from your current task (domain shift), it might fail to find the right document.

Furthermore, this is a distinct step. The LLM pauses, the retrieval happens, the context is updated, and then the LLM resumes. If the question requires multiple steps (multi-hop QA), this “stop-and-go” interaction becomes a complex loop of sub-queries and API calls.

RICHES (Retrieval Interlaced with Sequence Generation) proposes a radical simplification: LLM decoding is a search process.

When an LLM generates text, it searches through the space of all possible token sequences. If we can constrain that search space so that the model is only allowed to generate sequences that actually exist in our document corpus, the “generation” becomes a “retrieval.”

The Core Method: Constrained Decoding

The heart of RICHES is constrained beam decoding.

In a normal LLM, the probability of the next token is determined solely by the model’s training weights and the context so far.

\[ P _ { \theta } ( \mathbf { y } | \mathbf { x } ) = \prod _ { i = 0 } ^ { | \mathbf { y } | } P ( y _ { i } | y _ { 0 } , \ldots , y _ { i - 1 } , \mathbf { x } , \theta ) \]

RICHES modifies this probability distribution. It introduces a constraint that forces specific spans of the output (the “retrieval keys”) to exist within a pre-defined set of valid documents $K$.

\[ \begin{array} { l } { { \displaystyle P _ { \theta } ( { \bf y } | , { \bf x } , K ) = \frac { 1 } { Z } \prod _ { { \bf q } \in \mathcal { Q } ( { \bf y } ) } \mathbb { 1 } _ { K } ( { \bf q } ) } } \\ { { \displaystyle ~ \times \prod _ { i = 0 } ^ { n } P ( y _ { i } | y _ { 0 } , \dots , y _ { i - 1 } , { \bf x } , \theta ) } } \end{array} \]

While the equation above looks dense, the implementation is intuitive. When the model enters a “retrieval mode” (marked by special tokens like « and »), we restrict the vocabulary. We look at the prefix the model has generated so far, check which tokens could validly continue that sentence according to our document corpus, and set the probability of all other tokens to zero (or negative infinity).

Visualizing the Constraint

Imagine the model wants to cite who played the Joker. It has generated the prefix “Joker is played by”.

Figure 3: Ilustration of the constrained decoding process. Given prefix,“Joker is played by”,the continuation “Nolan” is not found in the corpus and therefore masked out.

As visualized in Figure 3, the LLM might inherently want to predict “Nolan” (perhaps confusing the director with the actor) or “Jared”. However, the system checks the corpus. The corpus contains sentences like “Joker is played by Ledger…” or “Joker is played by actor…”. It does not contain “Joker is played by Nolan”.

Consequently, the system applies a mask. The token “Nolan” gets a probability of 0. The token “Ledger”, which exists in the corpus, is allowed to pass through. The model is forced to be factual by the constraints of the database.

The FM-Index

You might wonder: How can we check if a sequence exists in a billion-token corpus at every single step of generation without slowing the model to a crawl?

The researchers use an FM-Index. This is a compressed suffix array structure often used in bioinformatics for DNA sequencing. It allows for incredibly fast substring searches. Crucially, the time it takes to find the next allowed token is independent of the corpus size—it depends only on the vocabulary size. This allows RICHES to scale to corpora like Wikipedia (millions of documents) while still generating text in real-time.

Adaptive Beam Search

Strict constraints can be dangerous. If the model starts generating a sentence that almost matches a document but gets one word slightly wrong, the constraints might leave it with zero valid next tokens. The generation would crash.

To solve this, RICHES uses Beam Search. Instead of just keeping the single best sequence, it keeps the top $k$ most likely sequences (beams) active at any time.

$Figure 2: Visualization of constrained beam for query: “when did marathon change its name to snickers?”. The final RICHEs output is “Marathon was renamed Snickers in \$1 9 9 0 ^ { \\prime \\prime }\$ . Bold boxes track the progress of the top-beam sequence. Grey crossed out boxes are sequences that the LLM preferred, but were blocked by corpus constraints.$

In Figure 2, we see the beam search in action for the query about Snickers bars.

t=2: The model explores options like “The” and “Marathon”.
t=5: It explores paths like “Marathon bars were” or “Marathon was renamed”. Some paths hit a dead end (marked by red Xs) because they deviate from the corpus text.
t=12: The beam has converged on the factual statement found in the corpus: “Marathon was renamed Snickers in 1990.”

The researchers introduced an Adaptive Beam. When the model is generating free text (thoughts or planning), it uses greedy decoding (fast, simple). When it enters a retrieval block («...»), it widens the beam to explore multiple potential document matches simultaneously. This ensures high recall for facts without wasting compute on the “thinking” parts.

What Are We Retrieving?

In traditional RAG, we usually index paragraphs (chunks of text). However, generating a whole paragraph verbatim is hard for an LLM—the wording has to match exactly.

The researchers explored several “Retrieval Keys”:

Titles: Just generating the document title.
Sentences: Generating a specific sentence from the text.
Propositions: Atomic, standalone facts derived from the text.

The study found that Propositions were the superior retrieval key. A large sentence might contain complex clauses that trip up the constrained decoding. A proposition is simplified, e.g., changing “He was born in…” to “Obama was born in…” This makes it easier for the model to generate the fact verbatim.

Table 3: Comparison of Retrieval Keys on NQ

As seen in Table 3, using Propositions (33.9 Hits@1) significantly outperformed searching for raw Paragraphs (19.0 Hits@1) or Titles (19.5 Hits@1).

Interlaced Generation: The Multi-Hop Advantage

The true power of RICHES appears in multi-hop questions—questions that require finding fact A to find fact B.

In standard RAG, this is an “Iterative Retrieval” process. The model retrieves a doc, reads it, formulates a new query, calls the retriever again, reads the new doc, and finally answers. This requires complex orchestration.

RICHES does this natively. Because retrieval is just text generation, the model can interleave “thoughts” (unconstrained planning) with “evidence” (constrained retrieval).

Table 2: Example Iterative retrieval outputs from RICHEs. Remarks are annotated as (# Comments)

Look at the third example in Table 2 regarding the “Charlotte Hornets.”

Thought: The model first identifies it needs to find the member of the Hornets from 1992-93.
Retrieval: It generates a constrained fact: « Muggsy Bogues played for the Charlotte Hornets... ».
Thought: It now knows the entity is “Muggsy Bogues” and plans to find his distinction.
Retrieval: It generates the next fact: « Muggsy Bogues is the shortest player ever... ».
Answer: It synthesizes the final answer.

This entire sequence happens in one continuous stream of tokens. There is no external retriever call, no context switching, and no complex pipeline code.

Results and Performance

The researchers evaluated RICHES against strong baselines, including Generalized Dense Retrieval (GTR) and Iterative RAG pipelines.

Single-Hop Performance: On the Natural Questions (NQ) dataset, RICHES proved it could stand toe-to-toe with dedicated dense retrievers.

Table 1: Example of RICHES vs Dense Retrieval for single-hop QA.Only the retrieved text is shown for illustration.

Table 1 highlights a qualitative difference. Dense retrievers often rely on keyword matching (e.g., retrieving “Prudential Center” just because the location matches). RICHES, driven by the LLM’s semantic understanding, tends to retrieve evidence that semantically answers the question (e.g., “Prudential Center is home to the New Jersey Devils”).

The Impact of Interleaving: The ablation study in Table 6 demonstrates the importance of mixing “thoughts” (unconstrained keywords) with retrieval.

Table 6: Interleaving unconstrained keywords and retrieval keys with Adaptive beam. Greedily decoding Unconstrained sub-sequences allows constrained retrievals to make the most of the beam search.

When the model is forced to just retrieve without “thinking” (Keywords: X), performance drops. When it can generate unconstrained keywords to guide the search, and uses the Adaptive Beam to manage the search space, performance peaks (40.2 F1 on NQ).

Win/Loss Analysis: RICHES isn’t perfect. The authors categorized failure modes in Table 7.

Table 7: Loss categories for RICHES on Hotpot-QA

“Search Failure” is the most common issue (52%). This means the correct evidence existed in the index, but the constrained beam search couldn’t find the path to generate it. This highlights the trade-off: constrained decoding is precise, but if the LLM’s internal language model doesn’t assign high probability to the phrasing used in the corpus, the beam might prune the correct document too early.

However, the “Wins” are significant. Table 15 shows instances where the unconstrained LLM would have hallucinated an answer (e.g., saying “Air Supply” sang “Only in My Dreams”), but the constraints forced it onto the correct path (“Debbie Gibson”).

Table 15: Unconstrained vs Constrained generation. Examples where unconstrained LLM emits incorrect answe but constraining on the corpus helps RICHES override this pre-existing knowledge to obtain the correct answer

Conclusion

RICHES represents a fascinating shift in how we think about “knowledge” in AI. Instead of treating the LLM as a brain in a jar that needs to be fed documents by an assistant, RICHES treats the LLM as an agent capable of looking up information itself simply by speaking it.

Key Takeaways:

Unified Architecture: Retrieval and Generation are unified into a single probabilistic process.
No Training Required: The method works with off-the-shelf instruction-tuned models (like PaLM 2) via prompting.
High Utility: It naturally supports attribution (citing sources) and multi-hop reasoning without complex loops.
Constraints as Guardrails: By forcing the model to generate text that exists in the corpus, we effectively mechanically block hallucinations during the evidence-gathering phase.

While there are challenges—specifically the computational cost of beam search and the need for high-quality propositional indexing—RICHES offers a glimpse into a future where “search” is just another form of “generation.”

The Problem with Pipelines#

The Core Method: Constrained Decoding#

Visualizing the Constraint#

The FM-Index#

Adaptive Beam Search#

What Are We Retrieving?#

Interlaced Generation: The Multi-Hop Advantage#

Results and Performance#

Conclusion#