Search engines have come a long way from simply matching keywords, but they still struggle with a fundamental problem: ambiguity. When a user searches “Is the US a member of WHO?”, a traditional system sees the word “us” (the pronoun) and “who” (the question word), potentially missing the crucial entities “United States” and “World Health Organization.”

This disconnection happens because many modern retrieval models rely on tokenization—breaking words down into smaller fragments called “word pieces.” While this helps computers handle rare words, it often shatters meaningful concepts into nonsensical syllables.

In this post, we’ll dive into DyVo (Dynamic Vocabularies), a research framework that attempts to fix this by teaching search models to “think” in terms of both words and Wikipedia entities. We will explore how DyVo dynamically injects millions of entities into sparse retrieval models, significantly improving their ability to understand complex queries.

To understand DyVo, we first need to look at the current state of Learned Sparse Retrieval (LSR).

LSR models, such as the popular SPLADE architecture, have become a dominant approach in first-stage retrieval. They work by encoding queries and documents into sparse vectors (vectors with mostly zero values), similar to traditional Bag-of-Words models like BM25. However, unlike BM25, which only counts the words that appear in the text, LSR models learn which words are important and can even expand the text with related terms.

The Tokenization Problem

LSR models typically use a Word Piece vocabulary derived from models like BERT. This vocabulary consists of roughly 30,000 sub-word units.

The problem arises when these models encounter complex entities. Consider the company “BioNTech.” A standard tokenizer might split this into ['bio', '##nte', '##ch']. Individually, these fragments lose the semantic meaning of the original entity. The model struggles to understand that ##nte is part of a biotechnology company.

Furthermore, bag-of-word-piece models struggle with homonyms. As shown in the image below, “US” and “WHO” are indistinguishable from common pronouns in a standard vocabulary.

DyVo augments BERT’s word piece vocabulary with an entity vocabulary to help disambiguate a query (or document). Word pieces are in blue and entities are in orange. Darker terms have a higher weight in the sparse representation.

Figure 1 illustrates the DyVo approach. Instead of relying solely on the blue word pieces (which are ambiguous), the model augments the representation with explicit orange entities like “World Health Organization” and “United States.”

Introducing DyVo: Dynamic Vocabularies

The researchers propose DyVo, a method to enrich LSR models with a massive vocabulary of entities (specifically, over 5 million entities from Wikipedia).

The core idea is to represent a document or query not just as a Bag of Words, but as a joint representation: Bag of Words + Bag of Entities.

The Scale Challenge

Integrating entities into a neural network sounds straightforward, but there is a major computational hurdle: Vocabulary Size.

Standard LSR models (like SPLADE) work by predicting a weight for every single term in the vocabulary for every input document. This is feasible when the vocabulary size is 30,000 (standard BERT). However, Wikipedia has over 5 million entities. It is computationally impossible to predict 5 million weights for every document we index.

The DyVo Solution

DyVo solves this by making the vocabulary dynamic. Instead of scoring every possible entity, the model follows a two-step process:

  1. Candidate Retrieval: Identify a small set of potentially relevant entities for the specific input text.
  2. Scoring: Only calculate weights for those specific candidates.

Figure 2: DyVo model with large entity vocabulary. The DyVo head scores entity candidates from an Entity Retriever component.

As shown in Figure 2, the architecture works as follows:

  1. Input: The text (query or document) is fed into the system.
  2. Entity Retriever: An external component identifies relevant entity candidates (e.g., “United States”, “WHO”).
  3. LSR Encoder: A transformer (like DistilBERT) processes the text to create hidden states (contextual representations of tokens).
  4. DyVo Head: The model computes scores only for the retrieved candidates using their pre-computed embeddings.

The Mathematical Framework

Let’s break down how DyVo constructs its final sparse representation. The total relevance score \(S(q,d)\) between a query \(q\) and a document \(d\) is the sum of the word-piece alignment and the entity alignment.

1. The Standard Word-Piece Score

First, the model calculates the standard LSR score using the word-piece vocabulary. This is done by encoding the query \(f_q(q)\) and document \(f_d(d)\) into sparse vectors \(s_q\) and \(s_d\), and taking their dot product.

Equation 1: The similarity between a query and a document is computed as the dot product between the two corresponding sparse vectors.

To get the weight for a specific word \(v_i\) in the document, the model looks at the hidden states \(h_j\) of the transformer. It calculates the dot product between the word embedding \(e_i\) and the hidden states, keeping the maximum activation (passed through a ReLU to ensure positivity and log-scaled).

Equation 2: Calculating the weight of the i-th vocabulary item based on the maximum activation across all token hidden states.

For the query side, the researchers found that a simpler linear projection (MLP) works better than the complex expansion used for documents. This keeps the query representation focused.

Equation 3: The query weights are calculated using a linear projection of the hidden states for tokens present in the query.

2. The Entity Score

This is where DyVo introduces its novelty. To score an entity (which is not a token in the original text), the model compares the Entity Embedding \(e_i^{entity}\) against the hidden states of the input text \(h_j\).

Equation 4: The weight of the i-th entity is calculated by finding the maximum similarity between the entity embedding and the text’s hidden states, scaled by a factor lambda.

Notice the term \(\lambda_{ent}\). This is a trainable scaling factor. The researchers found this crucial because entity embeddings often have different magnitudes than word embeddings. Without \(\lambda_{ent}\), the entity scores might dominate the word scores, or collapse to zero during training (more on this later).

3. The Joint Score

Finally, the system combines both worlds. The final similarity score is a summation of the word-piece overlap and the entity overlap.

Equation 5: The final relevance score integrates both word and entity vocabularies by summing their respective dot products.

This joint representation allows the model to leverage exact keyword matching (via word pieces) while simultaneously using high-level conceptual matching (via entities).

Where Do the Entities Come From?

A critical component of DyVo is the Entity Retriever (from Figure 2). The quality of the sparse representation depends entirely on finding the right candidates to score. The researchers explored several approaches:

1. Entity Linking (REL)

The traditional approach. An entity linker scans the text, identifies mentions (like “Paris”), and links them to Wikipedia IDs.

  • Pros: precise.
  • Cons: misses implicit entities. If a text discusses “The Big Apple,” a linker finds “New York City,” but might miss related concepts like “Wall Street” if they aren’t explicitly named.

2. Generative Retrieval (LLMs)

This is the state-of-the-art approach proposed in the paper. The researchers use Large Language Models (LLMs) like Mixtral or GPT-4 to “hallucinate” relevant entities.

They provide the LLM with the query and ask: “Identify Wikipedia entities that are helpful to retrieve documents relevant to this search query.”

This effectively uses the LLM’s vast internal knowledge to perform expansion. For a query about “NFTs,” an LLM can suggest “Non-fungible token,” “Ethereum,” and “Blockchain,” even if those words aren’t in the query.

Table 4: Qualitative comparison of entities retrieved by different systems. Note how GPT-4 identifies ‘Non-fungible token’ for the abbreviation ‘NFTs’, whereas BM25 and REL fail or return noise.

Table 4 highlights the difference. For the query about “NFTs,” the REL linker fails completely (linking “Next plc” and “Toronto”). BM25 finds random noise. GPT-4, however, correctly identifies “Non-fungible token,” “Cryptocurrency,” and “Bitcoin.”

Experimental Results

The researchers tested DyVo on three entity-rich datasets: TREC Robust04, TREC Core 2018, and CODEC.

Performance Gains

The results were clear: incorporating entities consistently improves retrieval performance.

Table 2: Results showing that DyVo (using REL, Mixtral, or GPT-4) outperforms the baseline LSR-w model across standard metrics like nDCG@10.

In Table 2, LSR-w represents the baseline model using only word pieces. DyVo consistently outperforms it. Notably, the Generative approaches (Mixtral and GPT-4) yield the highest scores, proving that using LLMs to infer relevant entities is more powerful than simple entity linking.

The Problem of Entity Collapse

Training these models wasn’t without challenges. One fascinating insight from the paper is the phenomenon of Entity Representation Collapse.

Because the entity embeddings come from a different source (e.g., Wikipedia2Vec or LaQue) than the trained BERT word embeddings, they live in different vector spaces. During the early stages of training, the model struggles to balance the two.

Figure 3: A graph showing entity representation collapse. The top graph shows the number of active entities dropping to zero around log step 10.

As shown in Figure 3, without careful tuning, the model might decide to suppress the entities entirely, pushing their weights to zero (the sharp drop in the top graph). Once the weights hit zero, the ReLU activation kills the gradient, and the entity head “dies”—it stops learning forever.

The researchers solved this by introducing the learnable scaling factor \(\lambda_{ent}\) mentioned in the math section, which helps stabilize the magnitudes of the two different vector spaces.

Does the Embedding Type Matter?

The researchers also asked: How should we represent the entities? They tested several embedding types:

  • Token Aggregation: Just averaging the word embeddings of the entity’s name.
  • Wikipedia2Vec: Pre-trained static embeddings.
  • LaQue / BLINK: Advanced transformer-based entity encoders.

While advanced encoders like BLINK provided the best performance, even the simple Wikipedia2Vec embeddings provided a significant boost over the baseline. This suggests that simply having unique dimensions for entities is the primary driver of success, rather than the specific nuance of the embedding vector.

Conclusion and Implications

DyVo represents a significant step forward in bridging the gap between lexical search (matching words) and semantic search (matching concepts). By treating Wikipedia entities as a dynamic extension of the model’s vocabulary, DyVo allows search engines to:

  1. Disambiguate complex terms (US vs. us).
  2. Update knowledge easily (by updating the entity database without re-training the whole model).
  3. Leverage LLMs for reasoning while keeping the efficiency of sparse retrieval (inverted indexes).

For students and researchers in Information Retrieval, DyVo illustrates a powerful design pattern: Decoupling the vocabulary from the model architecture. We are no longer limited to the tokens our model was pre-trained on; we can dynamically inject world knowledge precisely where it is needed.