Bringing Meaning Back to Tokenization: A Lexically Grounded Approach

In the world of Natural Language Processing (NLP), we often marvel at the sophisticated architectures of Large Language Models (LLMs) like the Transformer. We analyze attention mechanisms, feed-forward networks, and massive parameter counts. Yet, we frequently overlook the humble “front door” of these models: Tokenization.

Standard tokenization methods, like Byte-Pair Encoding (BPE) or SentencePiece (Unigram), are the industry standard. They are statistical powerhouses, designed to compress text efficiently and limit vocabulary size. However, they have a major flaw: they don’t actually understand the words they are breaking apart. They split words based on frequency, not meaning.

For example, a standard tokenizer might split the word “unhappiness” into un, happi, and ness—which looks great. But it might just as easily split a rare word into nonsense chunks that obscure its root meaning. This creates a disconnect between the statistical representation of a word and its linguistic reality (morphology).

In this post, we are diving deep into a fascinating paper titled “Lexically Grounded Subword Segmentation” by Jindřich Libovický and Jindřich Helcl. These researchers propose a novel framework to force tokenizers to respect the meaning of words—aligning subwords with actual morphemes—without sacrificing the efficiency we need for modern neural networks.

The Problem: When Statistics Ignore Meaning

Current state-of-the-art tokenizers are “morphologically ignorant.” They operate on the principle that the most frequent character sequences should be merged. While this works well for high-resource languages like English, it often fails for morphologically rich languages (like Czech, Turkish, or Finnish) or low-resource scenarios.

When a tokenizer breaks a word into random fragments, the model has to work much harder to learn that those fragments constitute a single concept. The authors of this paper argue that a strong segmentation should retain the efficiency of statistical approaches (splitting rare words more than frequent ones) but ensure that subword boundaries match actual morpheme boundaries.

The Three-Step Framework

To solve this, the researchers reconceptualize tokenization not as a single black-box operation, but as a three-step process.

Figure 1: We organize subword tokenization learning into four steps: pre-tokenization, vocabulary learning, inference, and distillation for efficiency. Steps (1)–(3) highlighted in yellow are specific contributions of this paper.

As shown in Figure 1 above, the pipeline allows for intervention at different stages:

Pre-tokenization: How we initially split the raw text (usually by whitespace or punctuation).
Vocabulary Construction: Building the list of available tokens (using BPE or Unigram).
Segmentation: The actual algorithm that decides how to slice a specific word using the vocabulary.

The authors introduce innovations at every level, which we will break down in detail.

Step 1: Smarter Pre-tokenization

Standard pipelines usually pre-tokenize text into “word-like” units—essentially splitting by spaces and punctuation. This assumes that the “space” character is the ultimate boundary of meaning.

The authors propose using Morfessor, an unsupervised morphological segmentation tool, for this step. Instead of feeding the tokenizer raw words, they feed it text that has already been linguistically analyzed and split into morphs (the surface form of morphemes).

By using Morfessor as a pre-processing step, the subsequent vocabulary construction (BPE or Unigram) starts with linguistically meaningful units rather than arbitrary character strings. This ensures that even before the neural network sees a single token, the data has been grounded in morphology.

Step 2: Lexically Grounded Segmentation (The Core Method)

This is the heart of the paper. Once we have a vocabulary, how do we decide how to split a word? Standard methods maximize the probability of the sequence. The authors instead propose maximizing semantic similarity.

The hypothesis is simple: A word and its subwords should share the same meaning.

To achieve this, the authors developed a mathematical method to compute subword embeddings that live in the same vector space as word embeddings.

Deriving Subword Embeddings

The researchers use the classic Skip-gram model (Word2Vec) as a foundation. In Skip-gram, word embeddings are trained to predict context words. The authors observed that the training objective of a Skip-gram model essentially tries to approximate the normalized co-occurrence matrix of words.

Mathematically, if \(E\) is the input embedding matrix and \(W\) is the output matrix, the model tries to satisfy:

Equation approximation

Here, \(C\) is the word co-occurrence matrix.

The researchers extend this logic to subwords. They keep the output matrix \(W\) fixed (from a pre-trained word model) and try to find a new matrix \(E_s\) (subword embeddings). They define a “Segmentation Matrix” \(A\), which maps which subwords belong to which words:

Segmentation Matrix A definition

If a subword \(s\) is part of word \(x\), the value is 1. Using this, they look for a subword embedding matrix \(E_s\) that satisfies the relationship between subwords and the original word contexts:

Subword Softmax approximation

By solving this using least-squares approximation, they derive a closed-form solution to generate embeddings for any subword:

Equation for Es

In plain English: This formula allows the researchers to generate a vector for any subword (like “ing” or “un”) that mathematically represents its contribution to the meaning of the full words it appears in.

The Similarity-Based Segmentation Algorithm

With these meaningful subword embeddings in hand, the segmentation algorithm changes. Instead of asking “What is the most frequent split?”, the algorithm asks “Which split preserves the meaning of the word best?”

The algorithm looks for a sequence of subwords \(s_1, s_2, \dots, s_n\) that maximizes the Cosine Similarity between the full word’s embedding \(E(x)\) and the subword embeddings \(E_s(s_i)\).

The scoring formula is:

Similarity Scoring Formula

The \(\alpha\) parameter acts as a length penalty. Without it, the model might prefer splitting a word into tiny pieces if those pieces happen to be very similar to the root. The penalty forces the model to be efficient—using as few subwords as possible while maintaining high semantic similarity.

Step 3: Distillation into a Bigram Model

The method described above is linguistically beautiful, but computationally heavy. It requires pre-trained word embeddings, Morfessor processing, and complex matrix operations during inference. This is not ideal for high-speed production environments.

To bridge the gap between “Linguistically Pure” and “Computationally Fast,” the authors propose a distillation step (Step 3 in Figure 1).

They run their heavy, embedding-based segmentation on a large corpus.
They calculate the statistics of how subwords follow one another (Bigrams) in this high-quality segmented data.
They build a Subword Bigram Model.

At inference time, this lightweight model simply predicts the most likely segmentation based on the bigram statistics learned from the “smart” model. This retains much of the morphological quality without the heavy computational cost.

Experiments and Results

The researchers evaluated their new tokenizer, dubbed LEGROS (Lexically Grounded Subword Segmentation), against standard BPE and Unigram models. They tested it on 9 languages, including Czech, English, French, and Russian.

1. Intrinsic Evaluation: Morpheme Boundaries

The most direct test was checking if the tokenizer actually splits words at morpheme boundaries. They used the SIGMORPHON 2022 shared task dataset for this.

Figure 2: Boundary precision, recall, and F1 score for Czech in the SIGMORPHON 2018 test set for different vocabulary sizes.

Figure 2 shows the results for Czech. The solid lines represent standard methods, while the dashed lines represent the new embedding-based method.

Precision (Left Graph): The embedding-based methods (and particularly Morfessor pre-tokenization) consistently achieve higher precision. This means when the model makes a cut, it is highly likely to be a real linguistic boundary.
Recall (Middle Graph): As vocabulary size grows (x-axis), recall drops for everyone (because larger vocabularies mean fewer splits overall). However, the LEGROS method remains competitive.

The takeaway is clear: Lexically grounded segmentation significantly improves the morphological plausibility of the tokens.

2. Extrinsic Evaluation: POS Tagging

To see if this “linguistic correctness” helps neural networks, they tested the tokenizer on Part-of-Speech (POS) tagging. This task relies heavily on understanding morphology (e.g., recognizing that “-ed” implies past tense).

Table 2: Test accuracies of POS tagging.

Table 2 shows the results across various languages. The “Aggr.” column (normalized accuracy) tells the story. The proposed methods (labeled “Ours”) consistently outperform the “Orig.” (original BPE/Unigram) baselines.

Morfessor + Bigram (Ours) achieved the highest aggregated score (0.745 and 0.712).
This confirms that providing the model with morphologically accurate tokens helps it understand grammatical structure better.

3. Extrinsic Evaluation: Machine Translation

Finally, they applied the tokenizer to a Machine Translation (MT) task using the IWSLT 2017 dataset. This is a much more complex task than POS tagging.

Table 3: Mean deviation from the average chrF score for 18 language pairs of the IWSLT 2017.

The results here (Table 3) are more nuanced. The values represent deviations from the average chrF score (a character-based metric).

There is no massive jump in performance. The proposed methods (“Ours”) perform comparably to standard BPE and Unigram.
In some cases, Morfessor pre-tokenization actually helps significantly (see the green cells in the bottom rows), but in others, the impact is negligible.

The authors honestly conclude that while their method improves morphological correctness, this does not automatically translate to better translation scores. However, the improved segmentation offers better interpretability and potentially better performance in low-resource scenarios where every token counts.

Conclusion

The paper “Lexically Grounded Subword Segmentation” offers a refreshing perspective on a component of NLP that is often taken for granted. By injecting linguistic knowledge (via Morfessor) and semantic grounding (via Subword Embeddings) into the tokenization process, the authors successfully created a system that respects the structure of language.

While it may not yet overthrow BPE for general-purpose Machine Translation, its success in POS tagging and morphological boundary detection proves that meaning matters.

For students and researchers, this work highlights distinct future directions:

Interpretability: Models trained on meaningful subwords are easier for humans to analyze.
Efficiency: The “Distillation” technique demonstrates that we can use complex, slow methods to train fast, lightweight inference models.
Low-Resource Languages: For languages where data is scarce, relying on morphology rather than just frequency statistics is likely the path forward.

This research reminds us that statistical correlation is not the same as understanding—and that bringing a little bit of linguistics back into deep learning can yield structurally superior models.

The Problem: When Statistics Ignore Meaning#

The Three-Step Framework#

Step 1: Smarter Pre-tokenization#

Step 2: Lexically Grounded Segmentation (The Core Method)#

Deriving Subword Embeddings#

The Similarity-Based Segmentation Algorithm#

Step 3: Distillation into a Bigram Model#

Experiments and Results#

1. Intrinsic Evaluation: Morpheme Boundaries#

2. Extrinsic Evaluation: POS Tagging#

3. Extrinsic Evaluation: Machine Translation#

Conclusion#