Can AI Search Engines Keep Up? Generative vs. Dual Encoder Retrieval in a Changing World

In the world of computer science research, benchmarks often rely on “static” data. We train a model on Wikipedia dumps from 2018, test it on questions about that data, and call it a day. But in the real world, information is fluid. News breaks, laws change, and new scientific discoveries are made every hour. A search engine that excels at retrieving history but fails to index today’s news is functionally useless.

This problem brings us to a fascinating showdown in Information Retrieval (IR): the battle between the reigning champion, Dual Encoders (DE), and the rising challenger, Generative Retrieval (GR).

In the paper Exploring the Practicality of Generative Retrieval on Dynamic Corpora, researchers from KAIST AI, Seoul National University, and others investigate how these two distinct architectures handle the “DynamicIR” problem. Can they learn new information without forgetting the old? How expensive is it to update them? The results challenge the current industry standards and suggest a significant shift in how we might build future search engines.

The Contenders: Dual Encoders vs. Generative Retrieval

Before diving into the experiments, we need to understand the fundamental difference in how these models “think.”

1. Dual Encoders (DE): The Matching Game This is the standard approach used in most modern vector search systems (like BERT-based retrievers).

Mechanism: You have two separate neural networks (encoders). One turns your search Query into a vector (a list of numbers). The other turns every Document in your database into a vector.
Retrieval: To find an answer, the system calculates the mathematical similarity (dot product) between the query vector and millions of document vectors to find the closest match.
The weak point: The “knowledge” is stored in an external index of vectors. Updating the corpus means re-encoding documents and updating a massive index.

2. Generative Retrieval (GR): The Memorizer This is a newer paradigm powered by sequence-to-sequence models (like T5 or BART).

Mechanism: There is no external index of vectors. The model is trained to take a Query as input and generate the identifier of the document (like its title or a unique substring) as output.
Retrieval: The model “memorizes” the documents within its own internal parameters (weights).
The potential: It simplifies the architecture (no separate vector database needed), but critics worried it would be hard to update or prone to hallucination.

The Challenge: Dynamic Information Retrieval

The researchers established a setup called DynamicIR to simulate the real world. They utilized the StreamingQA benchmark, which contains news articles and questions spanning from 2007 to 2020. This dataset is crucial because it includes timestamps, allowing the team to test how models handle the flow of time.

As illustrated in Figure 1, the researchers designed three distinct scenarios to test the models:

Figure 1: Structure of DynamicIR. This figure shows the training and inference processes for three setups in DynamicIR.

StaticIR (A & B): The baseline. Models are trained on an initial corpus (news from 2007–2019) and tested on it.
Indexing-based Update (C): A new corpus (news from 2020) arrives. The model cannot change its parameters (no training). It must simply “index” the new data.

For DE, this means encoding the new docs and adding them to the pile.
For GR, this means constrained decoding techniques to allow the model to output identifiers from the new list.

Training-based Update (D & E): The gold standard. The model is allowed to “continually pretrain” on the new 2020 corpus to internalize the new information, and then update its index.

The Core Method: How to Update a Generative Model?

One of the paper’s most significant technical contributions is determining how to efficiently train Generative Retrieval models on new data. If you retrain the whole model, it’s slow and you risk “catastrophic forgetting” (overwriting old knowledge).

The researchers analyzed which parts of the neural network change the most when learning new information. They defined these as Dynamic Parameters (DPs).

Figure 2: Analysis on key parameters in acquiring new knowledge. Through this analysis, we identify the locations of the top 10% most activated parameters.

As shown in Figure 2, they compared the model before and after learning new data. They discovered that the “knowledge” doesn’t live evenly across the model.

Table 2: Average number of Dynamic Parameters (DPs). It reveals that DPs are significantly more prevalent in the fully connected layer.

Table 2 confirms their hypothesis: the Feed-Forward Networks (FFN) (labeled as FC1 and FC2) contain roughly 25x more dynamic parameters than the Attention layers. This suggests that while Attention layers handle the “reasoning” or routing of information, the FFN layers act as the “key-value memory” storing the actual facts.

Based on this, the authors proposed a targeted update strategy. Instead of standard LoRA (Low-Rank Adaptation) which usually targets attention mechanisms, they applied LoRA specifically to the FFN layers. This allowed the GR model to absorb new 2020 news efficiently without forgetting the 2007–2019 history.

Experiments & Results

The team compared GR models (specifically SEAL, MINDER, and LTRGR) against DE models (DPR, Spider, Contriever). The results highlight three major wins for Generative Retrieval.

1. Adaptability: GR Learns Better

When exposed to the new 2020 corpus, Generative Retrieval models adapted significantly better than Dual Encoders.

Indexing-based Update: Even without retraining, GR showed 4% better adaptability.
Training-based Update: When allowed to train on the new data, GR outperformed DE by an average of 11%.

The visualization below breaks down the performance gap:

Figure 3: Visualization of total performance in DynamicIR.

In Figure 3, look at the performance of SEAL and MINDER (GR models). They maintain high accuracy across both initial (red stars) and new queries (blue stars). In contrast, the DE models (Spider, Contriever) struggle to balance the two.

2. Robustness: The “Temporal Bias” Trap

One of the most shocking findings was that Dual Encoders were “cheating.”

In the dataset, queries about the new corpus often contained the string “2020,” and the target documents also contained “2020.” Dual Encoders, which rely heavily on lexical matching via their encoders, were over-indexing on this timestamp.

When the researchers performed an ablation study by removing the specific timestamps from the text, the performance of Dual Encoders collapsed.

Table 9: Ablation Study on the bias towards temporal information. DE shows a lexical bias toward timestamps.

As Table 9 shows, when the bias-inducing timestamp is removed (w/o timestamp), the performance of Spider DE drops from roughly 34% (seen in previous tables) down to 17.40%. In comparison, the GR models (SEAL GR, MINDER GR) remained robust, dropping only slightly or maintaining high performance (around 37-39%). This proves that GR models were actually learning the content of the new news, while DE models were relying on surface-level shortcuts.

3. Efficiency: Doing More with Less

For real-world deployment, accuracy isn’t everything. Cost and speed matter. The paper presents a compelling case for GR’s efficiency.

Inference FLOPs (Computational Cost): Dual Encoders have a complexity of \(O(N)\), where \(N\) is the corpus size. To find a document, you technically need to compare your query against the entire database (or search a massive index). Generative Retrieval has a complexity of \(O(1)\) relative to the corpus size—it just generates tokens.

Figure 4: Inference FLOPs according to the number of instances. The flops for GR on both the static and updated corpus are identical.

Figure 4 illustrates this dramatic difference. The GR line (orange circles) stays flat and low. The DE lines (green) skyrocket as the system scales.

Storage and Indexing:

Storage: GR requires 4x less storage. DE needs to store a dense vector for every document (hundreds of gigabytes for large corpora). GR compresses this knowledge into its parameters.
Indexing Time: When updating the model, DE requires re-indexing the entire corpus, which took 20.4 hours in the experiment. GR only took 3.1 hours to update.

Conclusion and Implications

This research papers fundamentally challenges the dominance of Dual Encoders in dynamic environments. While Dual Encoders (like those powering many RAG systems today) are effective for static data, they show cracks when the world changes:

They struggle to integrate new knowledge without expensive re-indexing.
They rely on surface-level heuristics (like matching dates) rather than deep semantic understanding.
They are computationally heavy and storage-intensive at scale.

Generative Retrieval, specifically with the FFN-targeted training proposed by the authors, offers a more “organic” way to handle information. It learns new facts much like a human does—by updating its internal memory—rather than just filing new pages into a cabinet.

The Catch? The paper notes one limitation: Latency. While GR uses fewer computations (FLOPs), the actual wall-clock time for a search is currently slower than DE (milliseconds vs. half a second). This is because vector search (used by DE) is heavily optimized by libraries like FAISS, whereas autoregressive generation (used by GR) is sequential and slower.

However, as hardware accelerators and techniques like speculative decoding improve, the “slow generation” problem may vanish, leaving us with a retrieval paradigm that is smarter, lighter, and far more adaptable to our ever-changing world.

The Contenders: Dual Encoders vs. Generative Retrieval#

The Challenge: Dynamic Information Retrieval#

The Core Method: How to Update a Generative Model?#

Experiments & Results#

1. Adaptability: GR Learns Better#

2. Robustness: The “Temporal Bias” Trap#

3. Efficiency: Doing More with Less#

Conclusion and Implications#