Introduction

Large Language Models (LLMs) like GPT-4 and LLaMA are generalists. They can write a poem, solve a math problem, or summarize a history lesson with reasonable competence. However, when you drop these generalist models into a highly specialized environment—such as a law firm or a hospital—they often stumble. They lack the specific jargon and deep domain knowledge required to generate precise legal contracts or medical diagnoses.

To bridge this gap, researchers typically turn to Supervised Fine-Tuning (SFT) on domain-specific data. But there is a bottleneck often overlooked: the vocabulary. General models use a vocabulary optimized for general text. When they encounter specialized terms like “hemorrhoids” or complex legal statutes, they often break them down into inefficient, fragmented sub-tokens.

The standard solution has been to simply expand the vocabulary—stuffing the model with thousands of new domain-specific words. But is “more” always “better”?

A recent paper, Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs, suggests the answer is no. The researchers introduce a novel method called VEGAD (Vocabulary Expansion via GrADients). Instead of blindly adding words, VEGAD uses the model’s own gradients to identify which words are truly valuable, acting like a gold panner sifting through a river of text.

In this post, we will explore why vocabulary size isn’t everything, how VEGAD works under the hood, and why listening to a neural network’s gradients might be the key to building better domain-specific models.

The Problem: The “Goldilocks” Dilemma of Vocabulary

Before diving into the solution, we must understand the problem with current vocabulary expansion techniques.

When adapting an LLM to a new domain (e.g., Medicine), the standard workflow involves:

Collecting a domain corpus.
Using a tokenizer (like SentencePiece) to find frequent new words.
Adding all these new words to the model.
Fine-tuning.

The assumption is that a larger vocabulary improves encoding efficiency (sequences get shorter) and semantic understanding. However, the researchers conducted a pilot study that challenges this assumption.

Figure 1: Pilot study: Relative improvement comparing with direct supervised fine-tuning, by adding vocabulary with different sizes.

As shown in Figure 1 above, the researchers plotted the performance improvement of a model against the size of the added vocabulary.

The Orange Line represents the domain task performance. Notice how it peaks around a vocabulary size of 2,242 and then drops as more words are added.
The Yellow and Blue Lines represent general capabilities (Math and Average performance). These plummet significantly as the vocabulary size grows too large.

This phenomenon reveals a “Goldilocks” zone. If you add too few words, you miss out on efficiency. If you add too many, the model suffers from Catastrophic Forgetting—it gets so confused by the massive influx of new, potentially redundant tokens that it starts forgetting its basic pre-training (like how to do math).

The challenge, therefore, is not just expansion, but selection. How do we automatically find that optimal subset of words that boosts domain performance without breaking the model’s general intelligence?

Introducing VEGAD

To solve this, the authors propose VEGAD. The core philosophy of VEGAD is that not all frequent words are important. A word should only be added to the vocabulary if the model struggles to process it using its existing tokens.

Figure 2: Framework of VEGAD.

Figure 2 illustrates the VEGAD framework. It follows a logical pipeline:

Text Segmentation: Break domain data into candidate words.
Trie Construction: Organize these candidates into a data structure for efficient searching.
Gradient Calculation: Pass domain data through the General LLM and measure the gradients.
Selection: Filter words based on high gradient values (high “impact”).
Resize & Fine-Tune: Update the model with this curated vocabulary.

The key innovation here is step 3: utilizing gradients as a metric for importance. In deep learning, a large gradient implies that the model’s parameters need to change significantly to accommodate a specific input. If a candidate word triggers large gradients, it means the model considers that specific sequence of characters crucial and difficult to handle with its current vocabulary.

The Core Method: How It Works

Let’s break down the technical mechanics of VEGAD, specifically focusing on how it calculates these all-important gradients.

1. Building the Trie

First, the system needs a list of “candidate words” that might be worth adding. The researchers use a segmentation tool (like Jieba for Chinese) to generate a massive list of potential words from the domain corpus.

They then organize these words into a Trie (or prefix tree). A Trie is a tree-like data structure that is extremely fast at checking if a sequence of tokens matches a known word.

2. Gradient Calculation

This is the heart of the algorithm. The goal is to calculate a “score” for every candidate word in the Trie.

The researchers feed domain sentences into the pre-trained LLM. As the data flows through the model, they track the gradients at two specific points: the Embedding Layer (input) and the Language Modeling (LM) Head Layer (output).

$Figure 3: Gradient Calculation for each candidate word. Given the Trie built from candidate vocabulary, we check whether there exists a sub-sequence of the input and output on the path from the root of the Trie to a leaf node, by a pointer. The trace of the pointer is illustrated by \$V_i\$ and the “pseudo-leaf node”. Finally, the top K words with the largest gradients are selected to construct the new vocabulary, and used to resize the embedding layer and language modeling head layer.$

As seen in Figure 3, the process works as follows:

Forward Pass: The input tokens (e.g., “In the hospital…”) pass through the Transformer layers.
Backward Pass: The model calculates the loss (error) and propagates gradients back through the network.
Mapping: The system uses the Trie to map sequences of tokens back to the “candidate words.” For example, if the tokens “hyper” and “tension” appear, the Trie recognizes them as the candidate word “hypertension.”
Accumulation: The system sums up the gradients associated with the tokens that make up that candidate word.

The mathematical formulation for the gradient of a specific word $w$ is:

Equation for gradient accumulation

Here, the score $G_w$ is the sum of two parts:

Embedding Gradients ($G^{embed}$): How much the model wants to adjust the input representation of these tokens.
LM Head Gradients ($G^{lmhead}$): How much the model wants to adjust its output predictions for these tokens.

By summing these up (using norms), VEGAD assigns a single scalar value to every candidate word representing its “importance.”

3. Why the LM Head Matters

Most previous vocabulary expansion methods focused solely on the embedding layer. However, VEGAD explicitly includes the LM Head. The authors argue that for text generation tasks, the output layer is just as critical as the input. If the model knows what a word means (embedding) but doesn’t know how to predict it as the next token (LM Head), the expansion is incomplete.

4. Efficiency Optimization

Scanning a massive text corpus against a massive candidate vocabulary can be slow. To make this feasible, the authors utilize the Aho-Corasick Algorithm.

Figure 7: Aho-Corasick Algorithm. The fail pointers are highlighted with blue.

As shown in Figure 7, Aho-Corasick uses “fail pointers” (the blue arrows). This allows the algorithm to scan the text in a single pass. If a match fails (e.g., you match “b-c” but the next letter isn’t “a”), the pointer instantly jumps to the next longest possible valid prefix, rather than restarting the scan from scratch. This optimization significantly speeds up the gradient accumulation process.

Experiments and Results

The researchers validated VEGAD on two distinct domains: Law (using an Article QA dataset) and Medicine (using CMedQA and CMDD datasets). They compared VEGAD against standard baselines, including direct SFT (no vocab expansion), “Jieba” (adding all words found by the segmenter), and SentencePiece (SPM).

Performance on Domain Tasks

In the legal domain, VEGAD demonstrated a clear advantage.

Table 1: Results on Article QA of legal domain.

Looking at Table 1:

VEGAD achieves the highest scores in BLEU (28.58) and ROUGE-L (26.96).
Jieba (adding the full vocabulary) performs well but slightly worse than VEGAD.
SFT (no expansion) lags behind significantly.

This confirms that expanding the vocabulary helps, but expanding it selectively using VEGAD helps even more.

The results in the medical domain (specifically the CMDD dataset) were even more striking regarding the balance between learning and forgetting.

Table 4: Results on CMDD of medical domain.

In Table 4, VEGAD achieves the highest BLEU (5.84) and ROUGE scores. It clearly outperforms the “Jieba” baseline, which blindly adds medical terms.

Mitigating Catastrophic Forgetting

Perhaps the most important finding of this paper is not just that VEGAD learns the domain better, but that it remembers everything else better.

When you fine-tune a model on medical data, it often becomes “dumber” at general tasks like math or safety compliance.

GSM8K (Math): In Table 4, looking at the GSM8K column, the General LLM starts with a score of 23.55.
SFT drops this to 22.37.
VEGAD maintains a score of 23.31 (almost no loss).
Jieba drops slightly, but other methods like SPM often cause massive drops in other metrics (see Table 5 in the paper for relative drops).

This implies that because VEGAD selects only the most gradient-intensive (necessary) words, it disturbs the model’s original weights less than methods that flood the model with thousands of new tokens.

Ablation Study: The Importance of the LM Head

Was adding the gradient calculation from the LM Head actually necessary? The authors performed an ablation study to find out.

Figure 6: Ablation study on the gradient of LMHead Layer.

Figure 6 shows the improvement percentage when including the LM Head (Green) vs. excluding it (Brown).

For Domain-ROUGE (text generation quality), including the LM Head yields a 6.86% improvement, compared to just 1.01% without it.
This validates the hypothesis that for generative tasks, we must pay attention to how the model outputs the new vocabulary, not just how it reads it.

The “Sweet Spot” of Vocabulary Size

Finally, the researchers revisited the vocabulary size question using their new method.

Figure 4: Relative improvement of VEGAD comparing with direct SFT, by adding vocabulary with different sizes.

Figure 4 plots the relative improvement of VEGAD as the number of added words increases.

The Orange Line (Domain) peaks around 2,500 words.
Critically, look at the right side of the graph where the label “Jieba” is (representing adding ~4,600 words). The performance actually decreases or flattens out compared to the peak.
The Yellow Line (Math/GSM8K) crashes hard as the vocabulary size approaches the full set (Jieba).

This visually confirms the core thesis: Expansion with only a subset of the vocabulary leads to superior performance.

Conclusion and Implications

The “more is better” approach to data and model size has dominated AI for years. However, Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs reminds us that precision often beats volume.

By treating vocabulary selection as an optimization problem—guided by the model’s own gradients—VEGAD achieves a “best of both worlds” scenario:

Higher Domain Competence: The model learns the difficult, high-impact jargon necessary for the job.
Lower Catastrophic Forgetting: The model retains its general reasoning abilities because it isn’t being flooded with unnecessary tokens.

For students and practitioners working on domain-specific LLMs, the takeaway is clear: Don’t just dump a dictionary into your tokenizer. Listen to your model—it knows which words it needs to learn.

Introduction#

The Problem: The “Goldilocks” Dilemma of Vocabulary#

Introducing VEGAD#

The Core Method: How It Works#

1. Building the Trie#

2. Gradient Calculation#

3. Why the LM Head Matters#

4. Efficiency Optimization#

Experiments and Results#

Performance on Domain Tasks#

Mitigating Catastrophic Forgetting#

Ablation Study: The Importance of the LM Head#

The “Sweet Spot” of Vocabulary Size#

Conclusion and Implications#