Do LLMs Know What a Word Is? The Hidden Flaw in Subword Tokenization

When a child learns a language, they don’t start by speaking in complex, grammatically correct sentences. They start with words. A baby learns to recognize “doggie” or “ball” as distinct, meaningful units long before they understand how to use them in a sentence like “The doggie plays with the ball.” In developmental psychology, word learning precedes syntax.

But do Large Language Models (LLMs) learn the same way?

We often treat LLMs as proxies for understanding human language acquisition, yet a recent paper titled “Subword models struggle with word learning, but surprisal hides it” by Bastian Bunzeck and Sina Zarrieß suggests we might be making a mistake. The researchers found that the most common way LLMs process text—subword tokenization—fundamentally changes how and when they learn words, making their learning process drastically different from humans.

In this deep dive, we will explore how standard models struggle to distinguish real words from nonsense, how they hide this incompetence behind context, and why character-level models might actually be better at mimicking the human learning curve.

Figure 1: Illustration of word learning in human learners and transformer LLMs (top), and of our lexical decision test that probes discrimination of words from non-words (bottom). While human learners build up an mental lexicon from experience with language, artificial learners assign probabilities to strings based on their frequency.

The Problem: Learning Words vs. Learning Patterns

To understand the paper’s contribution, we first need to look at the architecture of modern AI. Most state-of-the-art models (like GPT-4 or Llama) do not read text letter-by-letter, nor do they read whole words. They use subword tokenization, typically Byte-Pair Encoding (BPE).

BPE breaks text into chunks based on frequency. Common words like “the” might be single tokens, while rare words or names might be split into multiple chunks (e.g., “Moggie” might become Mog + gie). This is computationally efficient, but the authors argue it is cognitively implausible. It splits words into units that don’t necessarily carry linguistic meaning.

The core question of this research is: Does a subword-based model actually learn what a “word” is? Or does it just learn which tokens statistically follow other tokens?

The Human Standard: Lexical Decision

Psycholinguists test human word knowledge using a Lexical Decision Task. A participant is shown a string of letters and must decide: Is this a real word?

Stimulus: Dog -> Result: Yes.
Stimulus: Mog -> Result: No (unless you know British slang, but in standard English, it’s a non-word).

Humans are excellent at this. We have a “mental lexicon”—a dictionary in our heads. We don’t need to see the word “dog” in a sentence to know it exists.

The AI Standard: Surprisal

AI researchers usually measure word learning differently, using Surprisal. This measures how “surprised” a model is to see a word in a specific context. If the model predicts a word with high probability, surprisal is low (meaning it has “learned” the word).

Context: “The fuzzy animal barked at the ____”
Target: Dog.

The authors hypothesize that Surprisal is a “cheat code.” It allows the model to guess the word based on syntactic context without actually knowing if the word is a valid linguistic unit on its own.

The Methodology: Stress-Testing the Models

To test this hypothesis, the researchers set up a comparison between two types of models and three types of tests.

1. The Models

They trained “BabyLMs”—smaller versions of the Llama architecture trained on 10 million words (a strictly limited dataset meant to mimic the amount of language a human child hears).

Subword Models (BPE): Standard tokenization (vocabulary size ~8,000).
Character Models: Tokenization at the letter level (vocabulary size ~100). These models must reconstruct words character by character.

They also compared these against models of varying sizes (Small, Medium, Large) to see if adding more parameters solved the problem.

Table 2: Model hyperparameters for our self-trained Llama models. Figure 3: Loss curves for our self-trained Llama models

2. The Stimuli: Wuggy

To generate fair test cases, the authors used a tool called “Wuggy.” Wuggy generates “pseudowords” that look and sound like English but aren’t.

Real word: sending
Pseudoword: monding

These pairs are matched for length and syllable structure. This ensures the model isn’t just rejecting non-words because they look “weird” (like xkqz).

Figure 4: Pairplot displaying (i) number of tokens of words, (ii) frequency scores from CELEX and (iii) number of tokens of non-words (for BPE and character tokenization)

As shown in the pairplot above, the researchers analyzed how these words were tokenized. Ideally, the complexity of the tokenization shouldn’t bias the result. The character models (orange) naturally have higher token counts per word, while BPE (blue) compresses them.

3. The Experiments

The researchers ran three distinct experiments to probe the models’ understanding:

Lexical Decision (The “Real” Test): The model is shown the word and the non-word in isolation (preceded only by a whitespace). Which one does it assign a higher probability to?

Logic: If the model knows the word sending exists in its vocabulary, it should have a higher probability than monding, even without context.

Surprisal (The “Easy” Test): The words are placed in a valid sentence.

Context: “I am sending a letter.” vs “I am monding a letter.”

Anti-Surprisal (The “Confusing” Test): The words are placed in a sentence that doesn’t fit semantically/syntactically.

Context: “The sky is sending blue.” (Nonsense meaning, but sending is still a real word).
Logic: Does the model prefer a real word over a fake word even when the context is garbage?

Results: The Subword Struggle

The results revealed a massive gap in competence between the two model architectures.

Subword Models Fail in Isolation

In the Lexical Decision task (identifying words without context), subword models performed poorly.

Small subword models were barely better than random guessing on high-frequency words.
Even the largest subword models struggled to reach the high reliability of character models.
Why? Because subword models don’t really have a concept of a “word.” They have a concept of tokens. If sending is tokenized as send + ing, and monding is mond + ing, the model might recognize the ing suffix in both. Without context to guide the root, it’s unsure which combination is “legal.”

Character Models Excel

Character models, however, achieved near-perfect accuracy (97-99%) on identifying high-frequency words, regardless of model size. Because they construct every word letter-by-letter, they seem to build a robust internal representation of which character sequences constitute valid English words and which do not.

Context Masks the Incompetence

When the researchers switched to the Surprisal task (putting words in sentences), the subword models suddenly performed well, matching the character models.

This confirms the “hiding” effect mentioned in the title. Subword models rely heavily on syntax (the structure of the sentence) to predict what comes next. If the sentence structure predicts a verb, the model can guess sending over monding because it fits the predictive pattern, not because it knows sending is a lexical item and monding is not.

This is further proven by the Anti-Surprisal task. When placed in bad contexts (“The sky is sending…”), subword models’ performance dropped significantly. They were confused by the bad context and lost their ability to distinguish the real word from the fake one. Character models, however, remained robust—they knew sending was a word, regardless of the silly sentence it was in.

The “When” of Learning: Disentangling Syntax and Words

Perhaps the most fascinating finding of the paper concerns the trajectory of learning. The researchers took checkpoints of the models throughout the training process to see when they learned different skills.

They compared the learning curves of:

Word Learning (Lexical Decision accuracy).
Syntax Learning (Performance on BLiMP, a benchmark for grammatical rules like Subject-Verb Agreement).

Figure 2: Selected lexical and syntactic learning curves

The charts above illustrate a fundamental difference in “cognitive development” between the architectures:

Character Models (Top Row): Look at the separation of the lines. The word learning curves (blue/purple) shoot up early and fast. The syntactic curves (green/orange) rise later and more slowly.
Interpretation: Like human children, character models learn words first, and then figure out the grammar rules later. The processes are separable.
Subword Models (Bottom Row): The lines are “entangled.” The curves for word learning and syntax learning rise together, almost simultaneously.
Interpretation: Subword models learn words and syntax as a single, muddled statistical soup. They don’t acquire a vocabulary and then learn how to use it; they learn the usage patterns and the tokens at the same time.

To visualize this statistical relationship, the researchers produced a correlation heatmap.

Figure 6: Correlation heatmap

In this heatmap, red/yellow indicates high correlation. You can see that for BPE (Subword) models, the performance on Lexical Decision is highly correlated with syntactic tasks (like Anaphor Agreement or Control Raising). For character models, the correlations are weaker or negative, indicating that these are distinct skills being learned at different times.

Deep Dive: The Evolution of “Wordness”

The researchers also looked at the raw probability differences between real words and non-words over the course of training.

Figure 7: Average differences between surprisal values across pretraining

In Figure 7, the Y-axis represents the preference for the real word over the non-word.

BPE Models (Blue lines): Notice how they start near zero or even negative? At the very beginning of training, subword models actually prefer the non-words! It takes significant training time (steps 10^3 to 10^4) for them to start reliably preferring real words.
Character Models (Orange lines): They show a preference for real words almost immediately.

This suggests that a priori tokenization (splitting words up before the model even sees them) prevents the model from discovering words naturally. It forces the model to skip the “word discovery” phase and jump straight to predicting token sequences.

Conclusion: Why This Matters for AI and Cognitive Science

This paper sheds light on a blind spot in current Natural Language Processing. By optimizing for efficiency and perplexity (predicting the next token), we have adopted subword tokenizers that fundamentally alter the nature of language learning.

For Cognitive Modeling: If we want to use LLMs to simulate how children learn languages, subword models are likely flawed. They merge two distinct developmental stages (lexical acquisition and syntactic acquisition) into one. Character-level models, despite being computationally heavier, offer a more human-like developmental path.
For AI Robustness: The reliance of subword models on context makes them brittle. As shown in the Anti-Surprisal experiments, if the context gets weird, the model “forgets” what a word is. Character models possess a more robust, context-independent definition of valid language.

As we continue to build larger and “smarter” models, we must ask ourselves: Do we want models that just predict patterns well, or models that understand the fundamental building blocks of language? This research suggests that to get the latter, we might need to rethink how we chop up the text before the model ever sees it.

The Problem: Learning Words vs. Learning Patterns#

The Human Standard: Lexical Decision#

The AI Standard: Surprisal#

The Methodology: Stress-Testing the Models#

1. The Models#

2. The Stimuli: Wuggy#

3. The Experiments#

Results: The Subword Struggle#

Subword Models Fail in Isolation#

Character Models Excel#

Context Masks the Incompetence#

The “When” of Learning: Disentangling Syntax and Words#

Deep Dive: The Evolution of “Wordness”#

Conclusion: Why This Matters for AI and Cognitive Science#