Introduction
We often assume that as Large Language Models (LLMs) grow larger and train on more multilingual data, their grasp of grammar naturally improves across all languages. We look at benchmarks like MMLU and see impressive reasoning scores, leading us to believe the basics are solved. But are they?
It turns out that even state-of-the-art models like GPT-4o and Llama 3 struggle with specific, nuanced grammatical rules in languages other than English. This isn’t just about hallucinating facts; it is about failing to adhere to the fundamental structural rules of a language.
In a fascinating paper titled “Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar,” researchers from the University of Tokyo uncover a surprising phenomenon. They found that massive, sophisticated models often prefer ungrammatical Japanese sentences over correct ones. Even more surprisingly, a smaller, seemingly “worse” model actually gets the grammar right—but for the wrong reason.
The culprit? Tokenization.
In this post, we will break down the “First Person Psych Predicate Restriction” in Japanese, explore why top-tier LLMs fail to model it, and discover how inconsistent tokenization acts as a hidden bias that confuses even the smartest AI.

As shown above in Figure 1, when asked to translate “My mother is cold,” a sophisticated model like GPT-4o produces a sentence that, while understandable, sounds deeply unnatural or even ungrammatical to a native Japanese speaker. Let’s explore why.
Background: The “Mind-Reading” Rule in Japanese
To understand why the models are failing, we first need a crash course in a specific quirk of Japanese linguistics: the First Person Psych Predicate Restriction.
The Rule
In English, adjectives that describe internal states (feelings, sensations) work the same way regardless of who you are talking about.
- “I am cold.” (Grammatical)
- “My mother is cold.” (Grammatical)
In Japanese, however, there is a strict epistemological rule: You cannot directly assert the internal state of another person. You can feel your own coldness, but you cannot know your mother feels cold; you can only observe that she appears to be cold.
The Examples
Let’s look at the “minimal pairs” used in this research. The word for “feeling cold” is samui.
1. First Person (The Speaker):
Watashi wa samui. (“I feel cold.”) Status: Grammatical. It is acceptable to state your own feelings.
2. Third Person (The Direct Error):
Haha wa samui. (“My mother feels cold.”) Status: Ungrammatical. This implies you have direct access to your mother’s nervous system. It sounds like you are mind-reading.
3. Third Person (The Evidential Correction):
Haha wa samu-soo da. (“My mother appears to feel cold.”) Status: Grammatical. By adding the evidential suffix -soo, you are describing an observation, which is permitted.
For an LLM to “know” Japanese, it should realize that sentence #2 is highly unlikely (high perplexity) compared to sentence #3.
The Experiment: Testing Perplexity
The researchers set out to test whether open-source models (7B to 10B parameters) respect this rule. They tested several popular models, including:
- Multilingual Models: Llama 2, Llama 3, Mistral.
- Japanese-Tuned Models: Swallow, Weblab-10B.
The Metric: Perplexity
They used perplexity as the yardstick. In simple terms, perplexity measures how “surprised” a model is by a sequence of text.
- Low Perplexity: The model thinks this sentence is natural and likely.
- High Perplexity: The model thinks this sentence is weird, wrong, or unexpected.
The Hypothesis: An ideal model should assign lower perplexity to the grammatical evidential sentence (samu-soo) and higher perplexity to the ungrammatical direct sentence (samui) when talking about a third person.
Results: The Strong Models Fail
The results, visualized below in Table 1, were shocking.

Take a look at the column for Llama 3.
- Ungrammatical (Row #): Perplexity is 6.9e+03.
- Grammatical Evidential (Row c): Perplexity is 3.7e+04.
Llama 3 is more perplexed by the correct Japanese grammar than the wrong one! It prefers the ungrammatical “mind-reading” sentence. This trend holds true for Mistral and Llama 2 as well.
However, look at Weblab. It is the only model highlighted in green. It assigns lower perplexity to the grammatical sentence (Row c) than the ungrammatical one (Row #).
Is Weblab a genius model with superior linguistic understanding? Not exactly. The reason for its success is much stranger.
Core Method: The Tokenization Trap
The researchers discovered that the root cause isn’t the model’s “brain” (the weights and attention mechanisms); it’s the model’s “eyes” (the tokenizer).
Understanding Tokenization
Before an LLM processes text, it chops it up into tokens. A token can be a whole word, part of a word, or even a single byte.
- Efficient Tokenization: Common words are single tokens. (e.g., “apple” = 1 token).
- Inefficient Tokenization: Rare or complex words are split into many small pieces. (e.g., “unprecedented” might be “un-pre-ced-ent-ed”).
The Byte Fallback Problem
Modern tokenizers use a mechanism called byte fallback. If a character (like a specific Japanese Kanji or Hiragana) isn’t in the model’s vocabulary, the tokenizer panics and breaks it down into raw bytes. This results in a sequence of many tokens for a single character.
Here is the critical finding: Models like Llama 3 are optimized for multilingual performance, but their Japanese vocabulary is inconsistent.
- Common Adjectives: Words like samui (cold) might be tokenized efficiently.
- Grammatical Suffixes: The specific conjugation required to make the sentence grammatical (-soo or suffixes for adjectives ending in -shii) often triggers byte fallback.
Because the grammatical version (samusoo) relies on these suffixes, the tokenizer breaks it into a long string of fragmented byte tokens.
- More tokens = Lower joint probability.
- Lower probability = Higher perplexity.
The model assigns a bad score to the correct sentence not because the grammar is wrong, but because the tokenization is messy.
Why Weblab “Succeeds”
Now, why did Weblab pass the test? Weblab was trained using an unmodified English tokenizer. It has almost no Japanese vocabulary.
- It tokenizes the ungrammatical sentence poorly (lots of byte fallback).
- It tokenizes the grammatical sentence poorly (lots of byte fallback).
Because Weblab handles everything uniformly badly, the “tokenization penalty” is equal across the board. The only remaining signal is the actual patterns the model learned during training. On a level playing field, the model correctly identifies that the evidential form is better.
Llama 3 fails because it has an inconsistent tokenizer: it handles the ungrammatical root word efficiently but chokes on the grammatical suffix.

Table 2 above confirms this. “Fertility” refers to how many tokens are needed per word.
- Llama 3 has a very low byte fallback rate (0.08), meaning it is usually efficient. But when it hits that specific edge case (the grammatical suffix), the sudden spike in tokens destroys the perplexity score.
- Weblab has a massive byte fallback rate (0.66). It is struggling constantly, but consistently.
Implications: Generating Ungrammatical Translations
Does this only matter for perplexity scores? Unfortunately, no. It affects text generation too.
When an LLM generates text, it looks for the path of least resistance (highest probability). If the grammatical suffix requires a “expensive” sequence of byte tokens, the model will avoid it.
The researchers tested this by asking the models to translate English sentences like “My mother is embarrassed” into Japanese.

As shown in Table 3:
- Weblab (the “weaker” model) successfully used the correct evidential expression (marked with a checkmark) in 90 out of 100 cases for “embarrassed.”
- Llama 3 struggled significantly. For “lonely” and “pain,” it never used the evidential form. It defaulted to the ungrammatical “no evidential” form 100% of the time.
Llama 3 is so biased against the tokenization of the correct suffix that it prefers to output broken Japanese rather than “pay the cost” of generating those tokens.
Conclusion
This research highlights a critical blind spot in current LLM development. We often assume that scaling up parameters and data will solve everything. However, tokenization is the foundation of language modeling. If the foundation is cracked—or inconsistent—the model builds a distorted view of the language.
The key takeaways are:
- Grammar is bound to tokenization: A model’s ability to learn grammar is capped by how consistently it tokenizes the necessary morphemes.
- Efficiency isn’t always effectiveness: A tokenizer that is 99% efficient but fails on the 1% of suffixes required for high-level grammar will cause the model to sound like a beginner.
- Consistency matters: Paradoxically, a uniformly bad tokenizer (like Weblab’s) allowed the underlying statistical learning to shine through, whereas Llama 3’s inconsistent capability masked its own knowledge.
For future models to truly master languages like Japanese, we don’t just need more data—we need tokenizers that treat grammatical structures with the respect they deserve. Until then, even the smartest AIs might continue to be baffled by what feels unnatural to them, but perfectly normal to us.
](https://deep-paper.org/en/paper/2505.19599/images/cover.png)