Large Language Models (LLMs) like GPT-4 are often treated as magic boxes that ingest “text” and output “text.” But strictly speaking, they don’t read words the way humans do. Before a model ever sees your input, a tokenizer chops your sentences into smaller chunks called subwords.
Usually, this segmentation is driven by frequency statistics—common words are kept whole, while rare words are shattered into fragments. But does this statistical chopping preserve the meaning of the word structure? If a model breaks the word “unhappines” into un, hap, pi, and ness, does it actually understand the linguistic rules that built that word?
In the paper “Subword Segmentation in LLMs: Looking at Inflection and Consistency,” researchers Marion Di Marco and Alexander Fraser investigate this crucial intersection of linguistics and engineering. They examine whether the way a model slices words affects its ability to understand morphology—the internal structure of words. Their findings offer a fascinating glimpse into how LLMs handle grammar and why “consistency” might matter more than linguistic perfection.
The Problem: When Statistics Ignore Grammar
Morphology is the study of how words are formed. In English, we know that a baker works in a bakery, and a botanist studies botany. We understand these connections because we recognize the root words and the suffixes.
LLMs use tokenization algorithms (like BPE or WordPiece) that care about compression, not linguistics. They want to represent the text using the fewest tokens possible. This often leads to messy splits. Ideally, a model would split the German verb einpflanzen (to plant in) into its logical parts: ein (particle), pflanz (root), en (suffix). However, a frequency-based tokenizer might mangle it into e, in, p, fl, a, n, z, e, n.
The researchers pose a critical question: Does this messy segmentation hurt the model’s performance on linguistic tasks?
To answer this, they evaluated GPT-4o across 10 different languages, ranging from English and French to Finnish and Hungarian. They proposed two specific criteria to measure the quality of segmentation and tested how words that met (or failed) these criteria performed in downstream tasks.
Criterion 1: Adherence to Morpheme Boundaries
The first hypothesis is intuitive: a “good” segmentation should respect the linguistic boundaries of the word. It should cleanly separate the stem (which carries the lexical meaning) from the inflection (the suffix that indicates tense, number, or person).
The researchers categorized how GPT-4o segments verbs into five groups based on how well they matched “Gold Standard” morphological data (from MorphyNet):
- EXACT: The word is split exactly into stem and suffix.
- SINGLE: The suffix is clean, but the stem is split further.
- CONCAT: The suffix is split into pieces, but the boundary between stem and suffix is respected.
- OVERLAP: The messy category. The split happens inside the stem or suffix, blurring the boundary.
- UNSPLIT: The word is a single token.
The following table illustrates these categories using French verbs as examples. Note how the OVERLAP category (e.g., comm + anda + ient) destroys the boundary between the root command and the ending.

How Bad is the “Overlap”?
When the authors analyzed GPT-4o’s vocabulary across 10 languages, they found something stark: The messy “OVERLAP” category is dominant.
As shown in the chart below, for almost every language (except Portuguese and Italian), the vast majority of verbs fall into the dark blue “OVERLAP” category. Even for English, a language with simple morphology, the model rarely finds the clean linguistic boundary.

This prevalence of messy segmentation makes the research question even more urgent: If the model is almost always ignoring linguistic boundaries, is it failing to learn the grammar?
Criterion 2: Segmentation Consistency
The second criterion moves away from strict linguistics and focuses on consistency. Even if the segmentation isn’t linguistically perfect, does the model at least segment the same word in the same way across its different forms?
Consider the German verb dramatisieren (to dramatize). If the model segments the present tense as dram + atis + ieren, but the past tense as dra + ma + tisi + erte, the model has to learn two completely different representations for the same underlying concept. Ideally, the “stem” portion of the tokens should remain constant throughout the inflection paradigm.
To measure this, the authors used the Overlap Coefficient. This metric calculates the similarity between the token sets of two word forms.

Unlike other metrics (like Jaccard), this formula allows for a perfect score (1.0) if the tokens of the lemma (the dictionary form) are a subset of the inflected form. This is desirable because inflected forms naturally add suffixes; we just want the stem tokens to stay the same.
The table below shows examples of high and low consistency. Notice the middle column (Italian): vincere (to win) is chopped up completely differently depending on the conjugation. This is high inconsistency.

Across the 10 languages studied, the distribution of these overlap scores varied, but many languages showed a “hump” around 0.5 to 0.7, indicating that perfect consistency is rare in current LLMs.

The Experiments: Does Segmentation Actually Matter?
To test if these criteria affect the model’s “brain,” the researchers set up two linguistic tasks for GPT-4o:
- Lemma Prediction: Given an inflected verb (e.g., složeny), predict the dictionary form (složit).
- Inflection Generation: Given a lemma and a set of grammatical tags (e.g., walk + Past Tense), generate the form (walked).
They compared the performance of words with “Good” segmentation (Clean boundaries / High consistency) against those with “Bad” segmentation (Overlap / Low consistency). They also separated words by frequency (Common vs. Rare) to see if the model relies on memorization for common words.
Result 1: Linguistic Boundaries Don’t Matter Much
Surprisingly, Criterion 1 (Adherence to Morpheme Boundaries) had very little impact.
When comparing the “OVERLAP” group (messy splits) to the “NO OVERLAP” group (clean splits), the performance on lemma prediction was nearly identical for most languages.
This suggests that strict adherence to linguistic rules—splitting exactly where a human linguist would—is not strictly necessary for an LLM to understand the word. The model seems capable of learning the meaning of messy tokens, provided it has seen them enough times.
Result 2: Consistency is Key
Criterion 2 (Consistency) told a different story. The researchers found a clear performance gap between words that are segmented consistently and those that are not.
In the table below for the Lemma Prediction task, look at the rows marked with asterisks (*). These indicate a statistically significant drop in performance for the lowOverlap group. This effect is particularly brutal for low-frequency words (freq ≤ 10). For example, in Hungarian (HU), accuracy for rare, inconsistent words dropped significantly compared to their consistent counterparts.

This trend continued in the Generation Task (creating the inflected form). Generating a word is harder than recognizing one, and here the penalty for inconsistent segmentation was even more severe.
In the zero-shot setting (where the model is given no examples), the lowOverlap (inconsistent) words performed significantly worse across almost all languages. Even when provided with a “one-shot” example, the consistent words generally outperformed the inconsistent ones.

Key Takeaway: If the model chops up “walk”, “walking”, and “walked” into completely different looking token soup, it struggles to understand that they are the same word, especially if that word is rare.
Does Position Matter?
The researchers dug deeper. If consistency is important, where is it most important? At the start of the word (the root) or the end (the suffix)?
They created a “Positional” experiment. They looked at pairs of Lemma/Form that had low similarity, but split them into two groups:
- Same 1st: The very first token is the same.
- Diff 1st: The first token is different.
The results were striking.

Look at the columns for languages like Czech (CS) and Hungarian (HU). The Diff 1st group (where the first token changes) sees a massive drop in performance compared to Same 1st.
In fact, the Same 1st group often performed just as well as the “High Similarity” group. This suggests that as long as the beginning of the word remains consistent—anchoring the lexical meaning—the model can tolerate a lot of messiness in the rest of the word. If the first token changes, the model loses the semantic thread.
Conclusion: The Case for Consistent Tokenization
This research highlights a hidden inefficiency in how we build Large Language Models. We rely on data-driven tokenizers like BPE because they are easy to train and compress text well. However, this convenience comes at a cost.
- Linguistic purity isn’t required: We don’t need tokenizers to act like professional linguists. The model doesn’t care if the split happens exactly at the suffix boundary.
- Consistency is non-negotiable: The model does care if the same word looks different in different contexts. When the segmentation of a word’s stem drifts across its inflected forms, the model struggles to generalize, particularly for languages with rich morphology (like Finnish or Hungarian) and for rare words.
The implications are significant for the future of multilingual LLMs. As we try to include less-resourced languages, we can’t rely on massive frequency data to brute-force the model into memorizing every irregular token split.
The authors suggest that future segmentation strategies shouldn’t just look at compression or frequency. They should optimize for consistency within paradigms. If we can ensure that “write,” “writes,” and “writing” all share a consistent subword anchor, we can make models that learn faster, generalize better, and understand the structure of language more deeply.
](https://deep-paper.org/en/paper/file-3682/images/cover.png)