Large Language Models (LLMs) are often treated as black boxes, but their foundation lies in a process that is surprisingly simple: tokenization. Before a model can understand “artificial intelligence,” it must break that text down into smaller chunks, or tokens. For years, the industry standard has been Byte-Pair Encoding (BPE), a reliable algorithm that merges frequent characters into subwords.

However, reliable doesn’t always mean efficient. Standard BPE has a hoarding problem. It creates and keeps “intermediate” tokens—fragments of words that are necessary to build larger words but are useless on their own. These “junk tokens” clutter the vocabulary, wasting valuable parameters and potentially degrading model performance.

In this post, we will explore a new method proposed by researchers called Picky BPE. This algorithm refines the vocabulary during training, identifying and removing useless tokens to make room for more meaningful ones. We will dive into how it works, the mathematics behind it, and why “being picky” might be the key to safer, more efficient language models.

The Problem: Vocabulary Bloat and “Junk” Tokens

To understand Picky BPE, we first need to look at how standard BPE works and where it fails. BPE starts with a vocabulary of individual characters (like a, b, c). It iteratively finds the most frequent pair of adjacent tokens in the text and merges them into a new token.

For example, if “t” and “h” appear together often, BPE merges them to create “th”. If “th” and “e” appear together often, it merges them to create “the”.

The problem arises because BPE keeps every token it creates. To build the word “Kentucky,” BPE might first merge “K” and “entucky” (assuming “entucky” was already formed). Once “Kentucky” is formed, the token “entucky” remains in the vocabulary. But when was the last time you used the word “entucky” in a sentence? Likely never. It is an intermediate token—useful for the merge, but redundant afterward.

This redundancy is illustrated perfectly in the figure below.

Figure 1: An example of a series of merges to produce a token Kentucky. The pre-merge token frequencies are demonstrated in corresponding circles.In the vanilla BPE algorithm,entucky should also be stored in the vocabulary,whereas it is redundant after the merge. In this example, the IoS metric effectively captures the intermediate token,as IoS(entucky) >= 0.9

In a fixed-size vocabulary (e.g., 32,000 tokens), every slot taken by a junk token like “entucky” is a slot denied to a useful token, like a full word or a common prefix. This leads to vocabulary inefficiency.

Furthermore, these rare, junk tokens are dangerous. Because they appear so infrequently in the training data (outside of the words they construct), the model barely learns them. These under-trained tokens can become “glitch tokens”—inputs that cause the model to hallucinate or bypass safety guardrails because the model effectively has no idea what they mean.

The Solution: Picky BPE

The researchers propose Picky BPE to solve this by introducing a “garbage collection” mechanism into the training process. Unlike previous methods that try to prune the vocabulary after training (which can be messy and heuristic-heavy), Picky BPE decides whether to keep or discard a token the moment a merge happens.

The Metric: Intersection over Self (IoS)

How does the algorithm know if a token is junk? The researchers introduced a metric called Intersection over Self (IoS).

The intuition is simple: if a token \(x_1\) appears almost exclusively as part of a larger token \(x_1 + x_2\), then \(x_1\) is likely just a building block that is no longer needed on its own.

Mathematically, if we are merging two tokens, \(x_1\) and \(x_2\), we calculate the IoS for both. Here is the formula for the first token, \(x_1\):

Equation 1: IoS calculation for the first token

And for the second token, \(x_2\):

Equation 2: IoS calculation for the second token

In these equations:

\(f_p(x_1, x_2)\) is the frequency of the pair (how often they appear together).
\(f_t(x_1)\) is the total frequency of the token \(x_1\) anywhere in the text.

The result is a value between 0 and 1. An IoS of 1.0 means the token \(x_1\) never appears in the text unless it is attached to \(x_2\). An IoS of 0.9 means it appears with \(x_2\) 90% of the time.

The Algorithm in Action

The Picky BPE algorithm introduces a hyperparameter called the threshold (\(\tau\)). This controls how “picky” the algorithm is.

If \(\text{IoS} \ge \tau\), the token is removed from the vocabulary.
A higher \(\tau\) (e.g., 0.9) is conservative, only removing clearly useless tokens.
A lower \(\tau\) (e.g., 0.6) is aggressive, removing tokens that might still have some independent use.

Here is the high-level view of the training algorithm:

Algorithm 1 and 2: Picky BPE Training and Tokenization steps

Let’s visualize this with a concrete example involving the word fragments around “would,” “could,” and “should.”

Figure 2: Picky BPE tokenization example. Token frequencies are demonstrated in the corresponding circles and are updated on merges. Token “ould" is removed only after merging into three common tokens containing it. The corresponding IoS values are visualized on every merge. Once IoS becomes greater or equal to the threshold, 0.9 in this example,the token “ould” is removed.

In the diagram above:

We have the token “ould”.
It merges with “w” to form “would”. But “ould” is also used in “could” and “should,” so its IoS relative to “would” is low (0.4). It stays.
It merges with “c” to form “could”. Now “ould” is mostly accounted for, but still used in “should.” IoS is 0.5. It stays.
Finally, it merges with “sh” to form “should”. At this point, “ould” appears almost nowhere else in the text except in these three words. Its IoS spikes to 0.9.
Since \(0.9 \ge \tau\) (threshold), “ould” is removed from the vocabulary.

This dynamic process ensures that a token is only deleted once it has served its purpose across the entire dataset.

Crucial Detail: Inference Order

One of the cleverest aspects of Picky BPE is how it handles reading text (inference) after training. Standard pruning methods often create inconsistencies. If you delete a token after training, you might accidentally change how a sentence is split, hurting text compression.

Picky BPE records every event—every Merge and every Remove—in a strict chronological order. When the tokenizer processes new text, it replays these events. It merges pairs, but if it encounters a “Remove” event for a token it just created, it knows that token is no longer valid for future merges in that specific sequence. This strictly adheres to the training logic, ensuring high-quality compression.

Analyzing the Impact

So, Picky BPE cleans up the vocabulary. But does this actually help the model? The researchers analyzed this from three angles: “glitch” tokens, translation performance, and text compression.

1. Eliminating “Glitch Tokens”

Under-trained tokens are a known security risk in LLMs. These are tokens that appear so rarely in training data that their embedding vectors (the mathematical representation of the token’s meaning) are not updated enough. They often end up with very small “L2 norms” (magnitude of the vector) and can trigger erratic behavior.

The researchers compared the token embeddings of a standard BPE model against a Picky BPE model.

First, let’s look at the distribution for standard BPE (\(\tau = 1.0\), meaning no removals).

Scatter plot showing Picky BPE tokens when Tau = 1.0. Orange dots represent tokens that would be removed at Tau = 0.8.

In the plot above (Figure a), the x-axis is the L2 norm (how well-trained the token is) and the y-axis is frequency. The orange dots represent the tokens that Picky BPE identifies as junk. Notice where they are clustered: bottom left. They are low frequency and low L2 norm. These are exactly the dangerous, under-trained tokens we want to get rid of.

Now, let’s see what happens when we enable Picky BPE with a threshold of \(\tau = 0.9\).

Contour plot showing Picky BPE tokens when Tau = 0.9. Pink regions show new tokens added.

In this plot (Figure b), the pink region represents new tokens that were added to fill the space left by the removed junk. These new tokens are higher frequency and have higher L2 norms.

The takeaway: Picky BPE effectively swaps out dangerous, useless tokens for frequently used, well-trained ones. This likely reduces hallucinations and improves model safety.

2. Text Compression

A common fear with reducing vocabulary is that text compression will suffer. If you remove the token “entucky,” you might have to represent “Kentucky” as ["K", "e", "n", "t", "u", "c", "k", "y"] (8 tokens) instead of ["K", "entucky"] (2 tokens). If text requires more tokens to represent the same sentence, the model becomes slower and less efficient.

However, because Picky BPE is dynamic, it doesn’t just delete; it makes room for new merges. The researchers found that Picky BPE maintains excellent compression rates.

Table 11: Compression for EN-DE tokenizers with different vocabulary sizes.

In the table above, a compression score of 1.000 is the baseline. Scores below 1.000 indicate better compression (fewer tokens needed). For English-German (EN-DE) vocabularies, Picky BPE (\(\tau < 1.0\)) achieved scores very close to, or even better than, the baseline. This is a significant improvement over other vocabulary trimming methods which typically increase sequence length by over 10%.

3. Downstream Performance: Machine Translation

Finally, does the model translate better? The researchers tested Picky BPE on English-German, German-Estonian, and Ukrainian-Estonian translation tasks.

They found that Picky BPE models performed on par with or better than standard BPE models across the board. In difficult translation pairs (like German-Estonian), the improved vocabulary efficiency led to higher COMET scores (a metric for translation quality).

Specifically, by removing intermediate junk, the tokenizer could afford to add more word-initial tokens (tokens that start a word, usually denoted with a generic underscore like _word).

Table 17: Overall proportion of word-initial tokens at different thresholds

As shown in Table 17, as the threshold (\(\tau\)) lowers (getting stricter), the percentage of word-initial tokens increases. Word-initial tokens are generally more semantically meaningful than mid-word fragments, suggesting the vocabulary is becoming higher quality.

Conclusion

Tokenization is the silent workhorse of modern AI, yet it remains riddled with inefficiencies. The Picky BPE algorithm demonstrates that we don’t need to accept “junk” tokens as a necessary evil of the training process.

By utilizing the Intersection over Self (IoS) metric, Picky BPE provides a mathematically sound way to distinguish between a useful subword and a redundant artifact. The results are clear:

Cleaner Vocabularies: Redundant tokens are removed.
Higher Efficiency: Freed-up slots are filled with meaningful, high-frequency tokens.
Improved Safety: The “glitch tokens” responsible for model instability are largely eliminated.
No Downside: Compression and translation performance are maintained or improved.

As Language Models continue to grow in size and cost, every parameter counts. Methods like Picky BPE offer a “free lunch”—a way to improve model quality and safety simply by being a little more selective about what goes into the dictionary.

The Problem: Vocabulary Bloat and “Junk” Tokens#

The Solution: Picky BPE#

The Metric: Intersection over Self (IoS)#

The Algorithm in Action#

Crucial Detail: Inference Order#

Analyzing the Impact#

1. Eliminating “Glitch Tokens”#

2. Text Compression#

3. Downstream Performance: Machine Translation#

Conclusion#