Introduction

In the rapidly evolving world of Large Language Models (LLMs), we often focus on the sheer size of the models—billions of parameters trained on trillions of words. However, there is a fundamental component of these models that has remained surprisingly rigid: the vocabulary.

Think of a language model as a builder constructing a house (a sentence). The builder uses bricks (tokens) to create the structure. In the current paradigm, the size and shape of these bricks are determined before the builder even starts learning. Once the “static” vocabulary is defined by a tokenizer (like BPE or WordPiece), it is locked. The model must construct everything, from simple articles to complex technical terms, using this fixed set of bricks.

But what if the builder could instantly manufacture custom-sized blocks—entire walls or pillars—on demand?

This is the core proposition of the research paper “Generation with Dynamic Vocabulary” by Liu et al. The researchers propose a novel framework where the model’s vocabulary is not a static list, but a dynamic entity that can adapt based on the input text. By allowing the model to generate arbitrary text spans (phrases) as atomic units, they demonstrate significant improvements in generation quality, speed, and domain adaptability—all without the need for expensive retraining.

In this post, we will deconstruct how this dynamic vocabulary works, the clever architecture behind the “Dynamic Phrase Encoder,” and why this approach might change how we think about tokenization in the future.

Background: The Constraints of Static Vocabulary

To appreciate the innovation of dynamic vocabulary, we first need to understand the limitations of the status quo. Modern LLMs (like GPT-4 or Llama) rely on a static vocabulary. Before training begins, a tokenizer analyzes a massive corpus to determine the most common sub-word units. These units become the model’s fixed vocabulary (usually between 30k to 100k tokens).

While effective, this approach has cracks in its foundation:

  1. Inefficiency: Generating a common phrase like “United States of America” might take multiple steps (one for each token), whereas a dynamic model could treat it as a single unit.
  2. Domain Rigidity: If a model trained on general text is moved to a medical or legal domain, its static vocabulary lacks the specific terminology. It has to break complex terms into tiny, meaningless sub-words, which degrades performance.
  3. Lack of Adaptability: Updating the vocabulary usually requires re-training the tokenizer and the model embeddings from scratch—a computationally prohibitive task.

The researchers argue that we need to relax these constraints. We need a system where the “bricks” can be defined on the fly.

The Core Method: Generation with Dynamic Vocabulary

The researchers introduce a Dynamic Vocabulary (\(V'\)). This is composed of the original static vocabulary (\(V\)) plus a set of arbitrary phrases (\(P\)) that can change depending on the context.

\[V' = V \cup P\]

Crucially, these phrases act just like standard tokens. The model can look at its context and decide whether to output a standard token (like “the”) or a complex phrase (like “theatre production”) in a single time step.

1. The Architecture

The system consists of two main inputs: the standard fixed tokens and a dynamic set of phrases derived from the input context or retrieved documents.

Figure 1: Generation with dynamic vocabulary. The model’s vocabulary dynamically changes based on the input text, with phrases serving as basic blocks both for input and output.

As shown in Figure 1, the process works as follows:

  • Left Side: We have a fixed token vocabulary (bottom) and a dynamic phrase vocabulary (top). In this example, phrases like “written by Mark Ravenhill” or “theatre production” are available.
  • Right Side: During generation, the model predicts the next step. It selects the phrase “theatre production” as a single atomic unit. This output is then fed back into the model for the next step.

2. The Dynamic Phrase Encoder

The biggest technical challenge is representing these variable phrases. You cannot have a lookup table for every possible phrase in the English language—it would be infinite.

The solution is the Dynamic Phrase Encoder. Instead of a static embedding table, the authors use a small neural network (initialized with a causal Transformer like GPT-2) to compute embeddings on the fly.

Here is the process for getting the vector representation of a phrase \(p\):

  1. Tokenize the phrase \(p\) using the standard static tokenizer.
  2. Feed it into the phrase encoder.
  3. Take the hidden state of the last token. This vector becomes the embedding for the entire phrase.

This creates a seamless bridge between the static and dynamic worlds. Because the phrase encoder uses the same tokenizer as the main Language Model (LM), there is no need for complex mapping or vocabulary switching during inference.

3. Inference: Expanding the Matrix

How does the LM know how to select these new phrases? In a standard Transformer, the output layer involves multiplying the hidden state by an output embedding matrix (\(\mathbf{W}_{\text{emb,out}}\)) to get logits for every token in the vocabulary.

With dynamic vocabulary, the researchers simply expand this matrix mathematically.

Equation describing the expansion of input and output embedding matrices.

As seen in the equation above, the new input matrix \(\mathbf{W}'_{\text{emb,in}}\) is created by concatenating the original token embeddings \(\mathbf{W}_{\text{emb,in}}\) with the new phrase embeddings \(\mathbf{P}\) generated by the encoder. The same logic applies to the output matrix.

This allows the model to calculate probabilities for both standard tokens and dynamic phrases simultaneously.

Equation showing the probability calculation for the next token or phrase.

The probability equation above shows that the normalization term \(Z\) (the denominator of the softmax) now sums over both the static vocabulary \(V\) and the dynamic phrases \(P\). This means the model naturally weighs the option of generating a word vs. a phrase based on which fits the context best.

4. Training and Negative Sampling

Training the phrase encoder is where the “magic” happens, but it is also where the difficulty lies. If you simply throw phrases at the model, it gets confused.

Imagine the target phrase is “New York City”.

  • The model might predict just “New” (a token).
  • It might predict “New York” (a phrase prefix).
  • It might predict “New York City” (the correct phrase).

The model suffers from decoding ambiguity. It struggles to distinguish a phrase from its own prefix or a longer continuation. To solve this, the authors employ aggressive Negative Sampling. They need to teach the model to differentiate the correct phrase from “hard negatives”—phrases that look similar but are incorrect for the current step.

Figure 2: The overall architecture of our proposed dynamic vocabulary with negative sampling strategies.

Figure 2 illustrates the training pipeline. Notice the box labeled “Negative Phrases.” The authors utilize four sources of negatives:

  1. Pre-batch: Phrases from previous batches (standard practice).
  2. Corpus-Retrieval: Finding where the phrase appears in a massive corpus and taking the surrounding text as negatives.
  3. Self-Retrieval: Using other potential phrases found within the current sentence.
  4. Generation: Using a model to hallucinate plausible but incorrect continuations of a phrase, forcing the encoder to learn very specific boundaries.

The Loss Function

To ensure the model doesn’t become biased toward only generating phrases (or only tokens), the training loss combines standard language modeling loss with a specific “alignment” loss.

Equation for the KL divergence loss function.

This KL divergence loss (\(L_{kl}\)) aligns the distribution of the dynamic model with the standard model. Essentially, it ensures that even when the model uses dynamic phrases, its underlying probability distribution remains grounded in the fundamental language statistics of the base model.

Experiments and Results

The researchers put their dynamic vocabulary to the test across several dimensions: generation quality, efficiency, domain adaptation, and citation capability.

1. Basic Language Modeling Performance

They evaluated the model on the WikiText-103 dataset, a standard benchmark for language generation. They compared their method against standard Transformers and other retrieval-augmented models like RETRO and kNN-LM.

Table 1: Automatic evaluation on WikiText-103 showing improvement in MAUVE, Diversity, and Latency.

Table 1 reveals impressive results:

  • Quality (MAUVE): The dynamic vocabulary model achieves a MAUVE score of 25.69, significantly higher than the standard Transformer (20.47). This metric correlates strongly with human judgment of text quality.
  • Diversity: The diversity score jumps to 47.44, indicating the model is less repetitive and more creative.
  • Latency: Note the latency column. The dynamic model is faster (0.99s vs 1.10s). By predicting multi-token phrases in a single step, the model requires fewer forward passes to generate text of the same length.

2. Human Evaluation

Metrics are useful, but human preference is the gold standard.

Table 2: Human evaluation results showing preference for the dynamic model.

In Table 2, human annotators compared the outputs. The dynamic model was preferred (“Better”) in 57% of cases against the Transformer. It specifically excelled in informativeness and coherence, likely because generating whole phrases helps maintain the semantic thread of a sentence better than piecing it together word by word.

3. Sequence Compression

One of the most interesting theoretical advantages of this method is “Sequence Compression.” How much information can we pack into a specific number of steps?

Table 3: Compression statistics on WikiText-103.

Table 3 shows that the dynamic model uses significantly fewer “atomic units” (101.38 vs 127.72) to represent the same sequence. It carries more bytes of information per token (5.54 vs 4.28). This efficiency is what drives the latency reduction mentioned earlier.

4. Domain Adaptation (Training-Free!)

This is perhaps the most practical application for real-world engineers. The authors took a model trained on general text (WikiText) and tested it on a completely different domain: Legal documents (LawMT).

Usually, a general model fails here because it doesn’t know legal jargon. However, the authors simply extracted legal phrases from the documents and added them to the dynamic vocabulary—without any fine-tuning of the model’s weights.

Table 6: Evaluation on Law-MT showing domain adaptation capabilities.

As shown in Table 6, the dynamic model (Ours) outperformed the standard Transformer—even one that was fine-tuned on the legal data! The MAUVE score hit 26.35 compared to the fine-tuned Transformer’s 23.06. This suggests that simply updating the vocabulary is a highly effective, lightweight strategy for domain adaptation.

5. Generating Citations

Finally, the authors explored using dynamic phrases for citations. In tasks like Question Answering, it is vital to cite sources. By associating specific phrases with specific source documents (e.g., “dynamic vocabulary[1]”), the model can generate the content and the citation simultaneously.

Table 7: Evaluation on ASQA for citation generation.

Table 7 demonstrates that this method drastically improves citation recall (9.76 vs 0.62) and precision compared to a standard TinyLlama model. The phrases act as “anchors” to the source material, ensuring that the generated answer remains faithful to the evidence.

Conclusion

The paper “Generation with Dynamic Vocabulary” presents a compelling shift in how we approach language modeling. By moving away from the rigid, static brick-laying of traditional tokenizers and embracing a dynamic, flexible system of phrases, we can unlock greater efficiency and adaptability.

Key Takeaways:

  1. Flexibility: Phrases can be added or removed from the vocabulary on the fly without retraining the model.
  2. Efficiency: Generating multi-token phrases in one step reduces latency.
  3. Adaptability: The model adapts to new domains (like Law) simply by updating its phrase list.
  4. Robustness: The Dynamic Phrase Encoder, trained with clever negative sampling, ensures the model seamlessly integrates these new phrases with standard tokens.

As LLMs continue to grow, efficient methods like this—which squeeze more performance and utility out of existing architectures—will be essential. Dynamic vocabulary turns the rigid walls of language models into modular, adaptable structures, ready for any domain or task.