Introduction

We often assume that as Large Language Models (LLMs) grow larger and train on more multilingual data, their grasp of grammar naturally improves across all languages. We look at benchmarks like MMLU and see impressive reasoning scores, leading us to believe the basics are solved. But are they?

It turns out that even state-of-the-art models like GPT-4o and Llama 3 struggle with specific, nuanced grammatical rules in languages other than English. This isn’t just about hallucinating facts; it is about failing to adhere to the fundamental structural rules of a language.

In a fascinating paper titled “Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar,” researchers from the University of Tokyo uncover a surprising phenomenon. They found that massive, sophisticated models often prefer ungrammatical Japanese sentences over correct ones. Even more surprisingly, a smaller, seemingly “worse” model actually gets the grammar right—but for the wrong reason.

The culprit? Tokenization.

In this post, we will break down the “First Person Psych Predicate Restriction” in Japanese, explore why top-tier LLMs fail to model it, and discover how inconsistent tokenization acts as a hidden bias that confuses even the smartest AI.

Figure 1: State of the art language models frequently fail to respect nuanced aspects of Japanese grammar, such as the first person psych predicate restriction, where here GPT-4o produces a sentence which is functionally identical to the ungrammatical Example (2) in Section 2.1.

As shown above in Figure 1, when asked to translate “My mother is cold,” a sophisticated model like GPT-4o produces a sentence that, while understandable, sounds deeply unnatural or even ungrammatical to a native Japanese speaker. Let’s explore why.

Background: The “Mind-Reading” Rule in Japanese

To understand why the models are failing, we first need a crash course in a specific quirk of Japanese linguistics: the First Person Psych Predicate Restriction.

The Rule

In English, adjectives that describe internal states (feelings, sensations) work the same way regardless of who you are talking about.

  • “I am cold.” (Grammatical)
  • “My mother is cold.” (Grammatical)

In Japanese, however, there is a strict epistemological rule: You cannot directly assert the internal state of another person. You can feel your own coldness, but you cannot know your mother feels cold; you can only observe that she appears to be cold.

The Examples

Let’s look at the “minimal pairs” used in this research. The word for “feeling cold” is samui.

1. First Person (The Speaker):

Watashi wa samui. (“I feel cold.”) Status: Grammatical. It is acceptable to state your own feelings.

2. Third Person (The Direct Error):

Haha wa samui. (“My mother feels cold.”) Status: Ungrammatical. This implies you have direct access to your mother’s nervous system. It sounds like you are mind-reading.

3. Third Person (The Evidential Correction):

Haha wa samu-soo da. (“My mother appears to feel cold.”) Status: Grammatical. By adding the evidential suffix -soo, you are describing an observation, which is permitted.

For an LLM to “know” Japanese, it should realize that sentence #2 is highly unlikely (high perplexity) compared to sentence #3.

The Experiment: Testing Perplexity

The researchers set out to test whether open-source models (7B to 10B parameters) respect this rule. They tested several popular models, including:

  • Multilingual Models: Llama 2, Llama 3, Mistral.
  • Japanese-Tuned Models: Swallow, Weblab-10B.

The Metric: Perplexity

They used perplexity as the yardstick. In simple terms, perplexity measures how “surprised” a model is by a sequence of text.

  • Low Perplexity: The model thinks this sentence is natural and likely.
  • High Perplexity: The model thinks this sentence is weird, wrong, or unexpected.

The Hypothesis: An ideal model should assign lower perplexity to the grammatical evidential sentence (samu-soo) and higher perplexity to the ungrammatical direct sentence (samui) when talking about a third person.

Results: The Strong Models Fail

The results, visualized below in Table 1, were shocking.

Table 1: Median perplexity over language models,for sentences corresponding to those introduced in Section 2.1. Weblab is theonly model which has lower perplexities for all grammatical constructions (labeled a,b,c)relative to the ungrammatical direct third person psych predicate (labeled #),which we believe is due to its uniformly bad tokenization. Green, yelow,and red indicate perplexity for grammatical constructions that are respectively lower than, equal to,and higher than that of the grammatical constructions.

Take a look at the column for Llama 3.

  • Ungrammatical (Row #): Perplexity is 6.9e+03.
  • Grammatical Evidential (Row c): Perplexity is 3.7e+04.

Llama 3 is more perplexed by the correct Japanese grammar than the wrong one! It prefers the ungrammatical “mind-reading” sentence. This trend holds true for Mistral and Llama 2 as well.

However, look at Weblab. It is the only model highlighted in green. It assigns lower perplexity to the grammatical sentence (Row c) than the ungrammatical one (Row #).

Is Weblab a genius model with superior linguistic understanding? Not exactly. The reason for its success is much stranger.

Core Method: The Tokenization Trap

The researchers discovered that the root cause isn’t the model’s “brain” (the weights and attention mechanisms); it’s the model’s “eyes” (the tokenizer).

Understanding Tokenization

Before an LLM processes text, it chops it up into tokens. A token can be a whole word, part of a word, or even a single byte.

  • Efficient Tokenization: Common words are single tokens. (e.g., “apple” = 1 token).
  • Inefficient Tokenization: Rare or complex words are split into many small pieces. (e.g., “unprecedented” might be “un-pre-ced-ent-ed”).

The Byte Fallback Problem

Modern tokenizers use a mechanism called byte fallback. If a character (like a specific Japanese Kanji or Hiragana) isn’t in the model’s vocabulary, the tokenizer panics and breaks it down into raw bytes. This results in a sequence of many tokens for a single character.

Here is the critical finding: Models like Llama 3 are optimized for multilingual performance, but their Japanese vocabulary is inconsistent.

  1. Common Adjectives: Words like samui (cold) might be tokenized efficiently.
  2. Grammatical Suffixes: The specific conjugation required to make the sentence grammatical (-soo or suffixes for adjectives ending in -shii) often triggers byte fallback.

Because the grammatical version (samusoo) relies on these suffixes, the tokenizer breaks it into a long string of fragmented byte tokens.

  • More tokens = Lower joint probability.
  • Lower probability = Higher perplexity.

The model assigns a bad score to the correct sentence not because the grammar is wrong, but because the tokenization is messy.

Why Weblab “Succeeds”

Now, why did Weblab pass the test? Weblab was trained using an unmodified English tokenizer. It has almost no Japanese vocabulary.

  • It tokenizes the ungrammatical sentence poorly (lots of byte fallback).
  • It tokenizes the grammatical sentence poorly (lots of byte fallback).

Because Weblab handles everything uniformly badly, the “tokenization penalty” is equal across the board. The only remaining signal is the actual patterns the model learned during training. On a level playing field, the model correctly identifies that the evidential form is better.

Llama 3 fails because it has an inconsistent tokenizer: it handles the ungrammatical root word efficiently but chokes on the grammatical suffix.

Table 2: Japanese fertility scores and byte fallback rates across studied models over sentences produced by the templates given in Appendix B.Due to its use of an unmodified English tokenizer,the majorityof tokens produced by Weblab are out-of-vocabulary.

Table 2 above confirms this. “Fertility” refers to how many tokens are needed per word.

  • Llama 3 has a very low byte fallback rate (0.08), meaning it is usually efficient. But when it hits that specific edge case (the grammatical suffix), the sudden spike in tokens destroys the perplexity score.
  • Weblab has a massive byte fallback rate (0.66). It is struggling constantly, but consistently.

Implications: Generating Ungrammatical Translations

Does this only matter for perplexity scores? Unfortunately, no. It affects text generation too.

When an LLM generates text, it looks for the path of least resistance (highest probability). If the grammatical suffix requires a “expensive” sequence of byte tokens, the model will avoid it.

The researchers tested this by asking the models to translate English sentences like “My mother is embarrassed” into Japanese.

Table 3: Weblab and Llama 3 outputs when asked to translate the English sentence“My mother is {psych predicate}” into Japanese.While Llama 3 struggled to output evidential expresions at all,Weblab was able to consistently output evidential expressions with a third person subject feeling“cold”or “embarrassed.” Here “grammatical” indicates alternative phrasings that are grammatical translations of the sentence,but do not require the use of evidential expressions.

As shown in Table 3:

  • Weblab (the “weaker” model) successfully used the correct evidential expression (marked with a checkmark) in 90 out of 100 cases for “embarrassed.”
  • Llama 3 struggled significantly. For “lonely” and “pain,” it never used the evidential form. It defaulted to the ungrammatical “no evidential” form 100% of the time.

Llama 3 is so biased against the tokenization of the correct suffix that it prefers to output broken Japanese rather than “pay the cost” of generating those tokens.

Conclusion

This research highlights a critical blind spot in current LLM development. We often assume that scaling up parameters and data will solve everything. However, tokenization is the foundation of language modeling. If the foundation is cracked—or inconsistent—the model builds a distorted view of the language.

The key takeaways are:

  1. Grammar is bound to tokenization: A model’s ability to learn grammar is capped by how consistently it tokenizes the necessary morphemes.
  2. Efficiency isn’t always effectiveness: A tokenizer that is 99% efficient but fails on the 1% of suffixes required for high-level grammar will cause the model to sound like a beginner.
  3. Consistency matters: Paradoxically, a uniformly bad tokenizer (like Weblab’s) allowed the underlying statistical learning to shine through, whereas Llama 3’s inconsistent capability masked its own knowledge.

For future models to truly master languages like Japanese, we don’t just need more data—we need tokenizers that treat grammatical structures with the respect they deserve. Until then, even the smartest AIs might continue to be baffled by what feels unnatural to them, but perfectly normal to us.