Search engines have come a long way from simply matching keywords, but they still struggle with a fundamental problem: ambiguity. When a user searches “Is the US a member of WHO?”, a traditional system sees the word “us” (the pronoun) and “who” (the question word), potentially missing the crucial entities “United States” and “World Health Organization.”
This disconnection happens because many modern retrieval models rely on tokenization—breaking words down into smaller fragments called “word pieces.” While this helps computers handle rare words, it often shatters meaningful concepts into nonsensical syllables.
In this post, we’ll dive into DyVo (Dynamic Vocabularies), a research framework that attempts to fix this by teaching search models to “think” in terms of both words and Wikipedia entities. We will explore how DyVo dynamically injects millions of entities into sparse retrieval models, significantly improving their ability to understand complex queries.
The Vocabulary Gap in Neural Search
To understand DyVo, we first need to look at the current state of Learned Sparse Retrieval (LSR).
LSR models, such as the popular SPLADE architecture, have become a dominant approach in first-stage retrieval. They work by encoding queries and documents into sparse vectors (vectors with mostly zero values), similar to traditional Bag-of-Words models like BM25. However, unlike BM25, which only counts the words that appear in the text, LSR models learn which words are important and can even expand the text with related terms.
The Tokenization Problem
LSR models typically use a Word Piece vocabulary derived from models like BERT. This vocabulary consists of roughly 30,000 sub-word units.
The problem arises when these models encounter complex entities. Consider the company “BioNTech.” A standard tokenizer might split this into ['bio', '##nte', '##ch']. Individually, these fragments lose the semantic meaning of the original entity. The model struggles to understand that ##nte is part of a biotechnology company.
Furthermore, bag-of-word-piece models struggle with homonyms. As shown in the image below, “US” and “WHO” are indistinguishable from common pronouns in a standard vocabulary.

Figure 1 illustrates the DyVo approach. Instead of relying solely on the blue word pieces (which are ambiguous), the model augments the representation with explicit orange entities like “World Health Organization” and “United States.”
Introducing DyVo: Dynamic Vocabularies
The researchers propose DyVo, a method to enrich LSR models with a massive vocabulary of entities (specifically, over 5 million entities from Wikipedia).
The core idea is to represent a document or query not just as a Bag of Words, but as a joint representation: Bag of Words + Bag of Entities.
The Scale Challenge
Integrating entities into a neural network sounds straightforward, but there is a major computational hurdle: Vocabulary Size.
Standard LSR models (like SPLADE) work by predicting a weight for every single term in the vocabulary for every input document. This is feasible when the vocabulary size is 30,000 (standard BERT). However, Wikipedia has over 5 million entities. It is computationally impossible to predict 5 million weights for every document we index.
The DyVo Solution
DyVo solves this by making the vocabulary dynamic. Instead of scoring every possible entity, the model follows a two-step process:
- Candidate Retrieval: Identify a small set of potentially relevant entities for the specific input text.
- Scoring: Only calculate weights for those specific candidates.

As shown in Figure 2, the architecture works as follows:
- Input: The text (query or document) is fed into the system.
- Entity Retriever: An external component identifies relevant entity candidates (e.g., “United States”, “WHO”).
- LSR Encoder: A transformer (like DistilBERT) processes the text to create hidden states (contextual representations of tokens).
- DyVo Head: The model computes scores only for the retrieved candidates using their pre-computed embeddings.
The Mathematical Framework
Let’s break down how DyVo constructs its final sparse representation. The total relevance score \(S(q,d)\) between a query \(q\) and a document \(d\) is the sum of the word-piece alignment and the entity alignment.
1. The Standard Word-Piece Score
First, the model calculates the standard LSR score using the word-piece vocabulary. This is done by encoding the query \(f_q(q)\) and document \(f_d(d)\) into sparse vectors \(s_q\) and \(s_d\), and taking their dot product.

To get the weight for a specific word \(v_i\) in the document, the model looks at the hidden states \(h_j\) of the transformer. It calculates the dot product between the word embedding \(e_i\) and the hidden states, keeping the maximum activation (passed through a ReLU to ensure positivity and log-scaled).

For the query side, the researchers found that a simpler linear projection (MLP) works better than the complex expansion used for documents. This keeps the query representation focused.

2. The Entity Score
This is where DyVo introduces its novelty. To score an entity (which is not a token in the original text), the model compares the Entity Embedding \(e_i^{entity}\) against the hidden states of the input text \(h_j\).

Notice the term \(\lambda_{ent}\). This is a trainable scaling factor. The researchers found this crucial because entity embeddings often have different magnitudes than word embeddings. Without \(\lambda_{ent}\), the entity scores might dominate the word scores, or collapse to zero during training (more on this later).
3. The Joint Score
Finally, the system combines both worlds. The final similarity score is a summation of the word-piece overlap and the entity overlap.

This joint representation allows the model to leverage exact keyword matching (via word pieces) while simultaneously using high-level conceptual matching (via entities).
Where Do the Entities Come From?
A critical component of DyVo is the Entity Retriever (from Figure 2). The quality of the sparse representation depends entirely on finding the right candidates to score. The researchers explored several approaches:
1. Entity Linking (REL)
The traditional approach. An entity linker scans the text, identifies mentions (like “Paris”), and links them to Wikipedia IDs.
- Pros: precise.
- Cons: misses implicit entities. If a text discusses “The Big Apple,” a linker finds “New York City,” but might miss related concepts like “Wall Street” if they aren’t explicitly named.
2. Generative Retrieval (LLMs)
This is the state-of-the-art approach proposed in the paper. The researchers use Large Language Models (LLMs) like Mixtral or GPT-4 to “hallucinate” relevant entities.
They provide the LLM with the query and ask: “Identify Wikipedia entities that are helpful to retrieve documents relevant to this search query.”
This effectively uses the LLM’s vast internal knowledge to perform expansion. For a query about “NFTs,” an LLM can suggest “Non-fungible token,” “Ethereum,” and “Blockchain,” even if those words aren’t in the query.

Table 4 highlights the difference. For the query about “NFTs,” the REL linker fails completely (linking “Next plc” and “Toronto”). BM25 finds random noise. GPT-4, however, correctly identifies “Non-fungible token,” “Cryptocurrency,” and “Bitcoin.”
Experimental Results
The researchers tested DyVo on three entity-rich datasets: TREC Robust04, TREC Core 2018, and CODEC.
Performance Gains
The results were clear: incorporating entities consistently improves retrieval performance.

In Table 2, LSR-w represents the baseline model using only word pieces. DyVo consistently outperforms it. Notably, the Generative approaches (Mixtral and GPT-4) yield the highest scores, proving that using LLMs to infer relevant entities is more powerful than simple entity linking.
The Problem of Entity Collapse
Training these models wasn’t without challenges. One fascinating insight from the paper is the phenomenon of Entity Representation Collapse.
Because the entity embeddings come from a different source (e.g., Wikipedia2Vec or LaQue) than the trained BERT word embeddings, they live in different vector spaces. During the early stages of training, the model struggles to balance the two.

As shown in Figure 3, without careful tuning, the model might decide to suppress the entities entirely, pushing their weights to zero (the sharp drop in the top graph). Once the weights hit zero, the ReLU activation kills the gradient, and the entity head “dies”—it stops learning forever.
The researchers solved this by introducing the learnable scaling factor \(\lambda_{ent}\) mentioned in the math section, which helps stabilize the magnitudes of the two different vector spaces.
Does the Embedding Type Matter?
The researchers also asked: How should we represent the entities? They tested several embedding types:
- Token Aggregation: Just averaging the word embeddings of the entity’s name.
- Wikipedia2Vec: Pre-trained static embeddings.
- LaQue / BLINK: Advanced transformer-based entity encoders.
While advanced encoders like BLINK provided the best performance, even the simple Wikipedia2Vec embeddings provided a significant boost over the baseline. This suggests that simply having unique dimensions for entities is the primary driver of success, rather than the specific nuance of the embedding vector.
Conclusion and Implications
DyVo represents a significant step forward in bridging the gap between lexical search (matching words) and semantic search (matching concepts). By treating Wikipedia entities as a dynamic extension of the model’s vocabulary, DyVo allows search engines to:
- Disambiguate complex terms (US vs. us).
- Update knowledge easily (by updating the entity database without re-training the whole model).
- Leverage LLMs for reasoning while keeping the efficiency of sparse retrieval (inverted indexes).
For students and researchers in Information Retrieval, DyVo illustrates a powerful design pattern: Decoupling the vocabulary from the model architecture. We are no longer limited to the tokens our model was pre-trained on; we can dynamically inject world knowledge precisely where it is needed.
](https://deep-paper.org/en/paper/2410.07722/images/cover.png)