Why Your Tokenizer is Biased (and How Uniform Sampling Fixes It)

If you have ever trained a modern Natural Language Processing (NLP) model, you have likely used a subword tokenizer. Whether it is Byte-Pair Encoding (BPE), WordPiece, or UnigramLM, tokenization is the invisible foundation upon which our massive language models act. We often treat tokenization as a solved preprocessing step—a static lookup that turns text into IDs.

But what if the way we feed tokens to a model during training is limiting its potential?

In the quest to make models more robust, researchers have turned to subword regularization—a technique where we intentionally “break” the canonical tokenization of a word to show the model different representations of the same text. While effective, new research suggests that the most popular methods for doing this (like BPE-Dropout) are mathematically flawed. They are heavily biased, restricting the model from seeing the full diversity of language structure.

In this post, we are doing a deep dive into the research paper “Distributional Properties of Subword Regularization.” We will explore why current stochastic tokenizers are biased, the graph-theory-based solution proposed by the authors (Uniform Sampling), and why this simple switch can immediately improve Machine Translation performance.

The Hidden Bias in Subword Regularization

To understand the solution, we first have to understand the problem.

How Deterministic Tokenization Works

Standard algorithms like BPE are deterministic. They maximize compression. If you feed the word tokenization into a trained BPE tokenizer, it will always output the same split, perhaps: to ken ization

This is efficient, but it creates a dependency. The model over-relies on this specific sequence. If the model encounters a typo or a rare morphological variation in the wild, it might fail because it never learned the sub-components of the word in different contexts.

The Rise of Stochastic Tokenization (Dropout)

To fix this, researchers introduced randomness, known as subword regularization. The most common method is BPE-Dropout.

In BPE-Dropout, during the tokenization process, the algorithm randomly skips merge rules with a probability \(p\). This forces the tokenizer to fall back to smaller subwords.

Canonical: to ken ization
Dropout version 1: t ok en ization
Dropout version 2: to ken iz ation

This acts as data augmentation and regularization. It makes the model robust to noise and helps it learn the compositionality of words.

The Problem: It’s Not Random Enough

Here is the core insight of the paper: Just because a process is random doesn’t mean it covers the search space evenly.

BPE-Dropout injects noise post-hoc into the greedy merge algorithm. It does not look at all possible ways to segment a word and pick one. It just randomly breaks the “best” way. The authors found that this results in a heavily biased distribution. Even with high dropout rates, the tokenizer heavily prefers a small set of segmentations close to the canonical one, while completely ignoring thousands of other valid segmentations.

Table showing the biased probabilities of BPE-Dropout and MaxMatch-Dropout.

As shown in the table above, for the word “tokenization,” BPE-Dropout assigns a massive 97.77% probability to the canonical split. The next most common split gets only 1.89%. The vast majority of valid ways to break down this word get near-zero probability.

If the goal of regularization is to expose the model to unique contexts and augment the data, this bias is artificially limiting the effectiveness of the training.

The Solution: Uniform Sampling via Lattices

The authors propose a theoretically rigorous alternative: Uniform Sampling.

Instead of randomly breaking the BPE algorithm, we should define the “search space” of all possible tokenizations for a given word using the available vocabulary, and then pick one path from that space with equal probability.

Step 1: Visualizing Words as Graphs

To achieve this, we move from simple string matching to graph theory. We can represent the tokenization process using Finite State Transducers (FSTs).

The Vocabulary Transducer (\(\mathcal{T}\)): We create a graph that represents every subword in our vocabulary.
The Word Automaton (\(\mathcal{A}\)): We represent the input word (e.g., “ababc”) as a linear sequence of characters.
The Composition (\(\mathcal{A} \circ \mathcal{T}\)): By combining these, we create a Lattice.

The Lattice is a Directed Acyclic Graph (DAG) where the start node is the beginning of the word, the end node is the end of the word, and every path from start to end represents a valid tokenization using the vocabulary.

Figure 1: Uniformly sampling tokenizations from A circle T.

In Figure 1 above:

(a) shows the input word “ababc”.
(b) shows the transducer for the vocabulary (all known subwords).
(c) is the resulting Lattice. Every path through this graph is a valid way to tokenize “ababc”.

Step 2: Sampling Without Bias

Once we have this lattice, the problem changes. We don’t need to “drop” merges. We simply need to select a random path from the Start Node (0) to the End Node (6).

However, a naive random walk won’t work. If we just flip a coin at every fork in the road, paths that are shorter or have fewer branches would be over-represented. To ensure Uniform Sampling—where every unique full path has the exact same probability of being chosen—the authors utilize a specific sampling algorithm (adapted from Lavrov, 2018).

This algorithm calculates the number of possible downstream paths from each node. It then weights the decision at each intersection so that the final probability of any complete path is exactly \(1/N\), where \(N\) is the total number of valid tokenizations.

Comparing the Distributions

The difference between the standard Dropout methods and this new Uniform Sampling is stark.

Figure 3: Distribution uniformity measured by Shannon Efficiency.

Figure 3 illustrates the Shannon Efficiency, a metric of how uniform a distribution is (higher is better/more uniform).

Red Circles (BPE Dropout): Even as you increase the dropout probability \(p\), the efficiency barely climbs. The distribution remains lumpy and biased.
Black Triangles (Uniform): This method guarantees maximum entropy. It explores the segmentation space perfectly evenly.

Why This Matters: Coverage

Why do we care about entropy? Because of Data Augmentation. We want our model to see as many different morphological breakdowns as possible to learn the true underlying language structure.

Figure 2: The number of unique, observed tokenizations of a word with N samples and dropout p.

Figure 2 is perhaps the most compelling visualization in the paper. It shows heatmaps of “coverage”—how many unique tokenizations are actually observed during training.

Top Row (Dropout): Notice the vast dark areas. Even with millions of samples, BPE-Dropout (top-left) fails to produce most valid tokenizations. It keeps spitting out the same few versions.
Bottom Row (Uniform): The Uniform Sampling method lights up the board. It exposes the model to a significantly more diverse set of inputs using the same vocabulary.

The Training Algorithm

Implementing this is surprisingly straightforward as a “drop-in” replacement for existing tokenizers. The authors propose a mixed strategy. We don’t want pure chaos; we still want the model to learn the canonical tokenization, as that is what it will likely see during inference.

The training strategy works as follows:

Set a probability \(p\) (e.g., 0.1 or 0.25).
For every word in the training corpus:

With probability \(p\), use the Uniform Sampling algorithm to generate a novel tokenization.
Otherwise, use the standard Deterministic tokenization.

Algorithm 3: Uniform Sampling Tokenization.

This hybrid approach ensures the model has a stable target (the deterministic path) while constantly being regularized by the unbiased uniform samples (the augmented paths).

Experiments and Results

The theory is sound, but does it actually translate (pun intended) to better performance? The authors tested this on Machine Translation tasks for English \(\leftrightarrow\) German, English \(\leftrightarrow\) Romanian, and English \(\leftrightarrow\) French.

They compared baseline BPE/MaxMatch against their Dropout variants and the new Uniform Sampling variant.

Key Findings

The results were consistent across almost all metrics (BLEU scores, CHRF, and COMET).

Table 5: The main results of machine translation performance.

As seen in the results table above (referencing the broader results in the paper’s appendix):

Uniform Wins: In nearly every language pair and metric, Uniform Sampling (p=0.1 or p=0.25) outperformed standard BPE-Dropout and MaxMatch-Dropout.
Efficiency: The improvements were not just in raw translation quality (BLEU) but also in semantic evaluation (COMET). For example, in English \(\rightarrow\) German, the Uniform model achieved a COMET score of 78.12 compared to the Dropout score of 77.51.
Consistency: The authors noted that in the English \(\rightarrow\) Romanian task, Uniform Sampling was the best performing model across all metrics and underlying tokenizers.

The hypothesis holds: by removing the bias from the subword regularization, the model receives higher-quality data augmentation, leading to better generalization.

Conclusion

We often obsess over model architecture—adding more layers, attention heads, or experts—while overlooking the data pipeline. This paper highlights that how we tokenize is just as critical as what we tokenize.

Standard subword regularization methods like BPE-Dropout act as a “band-aid,” injecting noise that is statistically biased and limited in scope. By strictly modeling the tokenization space as a lattice and sampling uniformly from it, we can unlock:

True Regularization: Breaking the dependency on canonical splits.
Maximal Augmentation: Seeing more unique views of the same data.
Better Performance: Consistent gains in downstream tasks like translation.

For students and practitioners, the takeaway is clear: randomness is not always uniform. When designing stochastic processes for AI, ensuring that your distribution actually covers the search space can be the difference between a good model and a state-of-the-art one.

All images and data cited in this post are derived from the research paper “Distributional Properties of Subword Regularization” by Cognetta, Zouhar, and Okazaki.

The Hidden Bias in Subword Regularization#

How Deterministic Tokenization Works#

The Rise of Stochastic Tokenization (Dropout)#

The Problem: It’s Not Random Enough#

The Solution: Uniform Sampling via Lattices#

Step 1: Visualizing Words as Graphs#

Step 2: Sampling Without Bias#

Comparing the Distributions#

Why This Matters: Coverage#

The Training Algorithm#

Experiments and Results#

Key Findings#

Conclusion#