Do LLMs Actually Know How to Spell? Inside the CUTE Benchmark
When we interact with Large Language Models (LLMs) like GPT-4 or Llama 3, we often attribute human-like literacy to them. We assume that because a model can write a sonnet or debug Python code, it understands text the same way we do: letter by letter, word by word.
However, this is a fundamental misconception. LLMs do not “read” text character by character. Instead, they process text through a tokenizer, which chunks characters into tokens. A common word like “the” is processed as a single atomic unit, not as the sequence t-h-e. To the model, the token the is just an integer ID in a massive list, distinct from The, theoretical, or lathe.
This architectural reality begs a fascinating question: If LLMs don’t natively see characters, do they actually understand orthography (spelling)? Can they manipulate text at the character level, or are they simply memorizing statistical patterns?
In this post, we will dive deep into a research paper titled “CUTE: Measuring LLMs’ Understanding of Their Tokens” by researchers from LMU Munich and TU Munich. This paper introduces a benchmark designed to expose the “blind spots” in modern LLMs, revealing that while these models can act like they know how to spell, their internal understanding of word composition is surprisingly brittle.
The Problem with Tokens
Before dissecting the paper, let’s establish why this matters. Most modern LLMs use Byte-Pair Encoding (BPE) or similar subword tokenization methods.
- Efficiency: It’s faster to process one token for “apple” than five tokens for
a-p-p-l-e. - Context: It allows the model to handle larger context windows.
However, this efficiency comes at a cost. The model loses direct access to the character composition of the token. It only knows that Token A usually follows Token B. If you ask an LLM to “output the third letter of the word ‘apple’,” it can’t just look at the string. It has to rely on associations learned during training—essentially, it has to have “memorized” that the concept of “apple” contains the concept of “p” in the third slot.
The researchers behind CUTE set out to test the limits of this indirect knowledge. They wanted to answer three specific questions:
- Do LLMs know which characters make up their tokens?
- Do LLMs understand the difference between looking similar (orthography) and meaning similar things (semantics)?
- Can LLMs manipulate text at the character level (e.g., inserting or deleting letters)?
Introducing CUTE: Character-level Understanding of Tokens Evaluation
To answer these questions, the authors developed CUTE, a benchmark composed of tasks that are trivial for literate humans but potentially difficult for token-based models.
The genius of CUTE lies in its comparative structure. For many tasks, the researchers created two versions: a Character-level version and a Word-level version.
- Character-level: Requires breaking open a token (e.g., “Remove the ’e’ from ’there’”).
- Word-level: Operates on whole tokens (e.g., “Remove the word ’the’ from ’the sky is blue’”).
This comparison is vital. If a model fails at the character task but succeeds at the word task, we know the model isn’t confused by the instruction—it is specifically confused by the composition of the token.
Figure 1: A breakdown of the tasks in the CUTE benchmark. Note the distinction between operations on single words (character level) and sentences (word level).
Let’s break down the three main categories of the benchmark as shown in Figure 1.
1. Composition Tasks
These tasks test basic knowledge of what is inside a token.
- Spelling: The model is given a word (usually a single token) and must output it with spaces between letters (e.g.,
there\(\rightarrow\)t h e r e). - Inverse Spelling: The reverse operation. The model gets spaced letters and must combine them.
- Contains Char/Word: A simple boolean check. “Is there a ‘c’ in ’there’?” vs. “Is there a ’the’ in ’the sky is blue’?”
2. Similarity Tasks
This is where things get interesting. Language models are trained on semantic context. In the vector space of an LLM, “happy” is mathematically close to “glad.” But is “happy” close to “apply”?
- Orthographic Similarity: The model is given a word (e.g., “happy”) and two options. One is spelled similarly (low Levenshtein distance, like “apply”), and one is not. Can the model pick the look-alike?
- Semantic Similarity: A control task. Can the model pick the synonym?
3. Manipulation Tasks
These are the stress tests. They require the model to act on its knowledge.
- Insertion: Add a character after every instance of another character (e.g., add ‘b’ after every ’e’).
- Deletion: Remove specific characters.
- Substitution: Replace characters (e.g., change ’e’ to ‘a’).
- Swapping: Swap the positions of two specific characters.
Each of these has a corresponding word-level task (e.g., swapping words in a sentence).
Experimental Setup
The researchers tested a variety of popular open-source models, ranging from 7 billion to 132 billion parameters. The lineup included Llama 2, Llama 3, Mistral, Gemma, Command-R, and DBRX.
To ensure the test was fair, they used few-shot prompting. This means they didn’t just ask the question; they provided four examples of the task being performed correctly before asking the test question. This teaches the model the format required without fine-tuning its weights.
Figure 3: An example of the few-shot prompt used for the spelling task. The model sees four examples of “Spelling out” before attempting the target word “cow”.
The Results: A Tale of Two Granularities
The results, visualized in Figure 2 below, paint a complex picture of LLM capabilities. The models are not universally bad at orthography, but they are highly inconsistent.
Figure 2: Accuracy across all tasks. Pay close attention to the gap between Word tasks (blue bars) and Character tasks (yellow/orange bars).
1. The Illusion of Competence: Spelling
Look at the top two rows of Figure 2. Most models, especially the larger ones like Llama-3-70B and Command-R+, score incredibly high on Spelling and Inverse Spelling.
You might think, “Great! They know how to spell.” But the authors suggest caution here. Spelling is a very common task in training data. It is highly likely that the models have simply memorized the mapping between the token there and the sequence t, h, e, r, e. It doesn’t necessarily mean they understand the structure; they just know the answer key.
2. The Semantic Bias: Similarity
The Similarity results (middle of Figure 2) reveal a fascinating blind spot.
- Semantic (Purple stripes): Models are great at this. They know “happy” means “glad.” This is what they were built for.
- Orthographic (Red): Performance here is abysmal. Many models perform near or even below random chance (50%).
This confirms a major hypothesis: LLMs understand meaning, not shape. They struggle to recognize that “apply” looks like “happy” because, in their internal embedding space, those two concepts are galaxies apart. Only the very largest models (like Command-R+) show significant ability here, likely due to sheer volume of data exposure.
3. The Collapse: Manipulation
The most damning evidence comes from the Manipulation tasks (bottom half of Figure 2).
- Word-level (Blue bars): Models are generally competent. They can insert, delete, and swap words in a sentence. They understand the logic of the instruction.
- Character-level (Yellow/Orange bars): Performance falls off a cliff.
Take Insertion as an example. Even powerful models struggle to “add ‘b’ after every ’e’”. Why? Because to do this, the model has to:
- Deconstruct the token
thereintot-h-e-r-e. - Locate the
e. - Insert the
b. - Reconstruct the string.
Since the token there is an atomic unit to the model, it can’t “see” inside it to perform the logic. It has to rely on rote memorization of letter positions, which fails when applying a rule dynamically.
The Random String “Smoking Gun”
To prove that tokenization is the culprit, the researchers performed a clever control experiment. They ran the same tasks on random strings (like fxqg).
Because these strings are nonsense, the tokenizer cannot match them to a single pre-learned token. Instead, the tokenizer is forced to break them down into individual characters or very small chunks.
Figure 5: Performance on standard vocabulary words (Vocab) vs. Random strings (Rand). Ideally, standard words should be easier, but notice the pink bars.
As shown in Figure 5, models often performed better on random strings (Pink bars) than on real words (Gray bars) for manipulation tasks.
This is counter-intuitive but makes perfect sense architecturally. When the model processes a random string, it sees the individual characters as separate inputs. It doesn’t have to “unpack” a token. It can simply look at the sequence f, x, q, g and apply the manipulation rule. When dealing with a real word like there, the tokenization hides the characters, making the task harder.
Multilingual Capabilities: No Magic Bullet
The study also extended to Russian (CUTE-Rus) to see if multilingual models—which encounter different scripts and tokenization patterns—fare better.
Figure 4: Results on the Russian version of the benchmark.
The trends in Russian largely mirror those in English. Interestingly, Figure 4 shows that models struggled significantly with basic spelling in Russian, even when provided with examples. This suggests that the “memorization” of spelling tables is heavily skewed toward English in the training data. Even multilingual models like Aya didn’t show a massive advantage derived purely from their multilingual nature; their performance gains were mostly correlated with their general size and quality.
Does Scaling Solve It?
A common refrain in AI research is “Scale is all you need.” Does making the model bigger fix its inability to see characters?
The data suggests: Yes and No.
Looking at Figure 2, larger models (like Llama-3-70B) significantly outperform smaller ones (like Llama-2-7B). They are better at following instructions and have memorized more spelling patterns.
However, the gap remains. Even the largest models struggle with character swapping and insertion compared to their word-level counterparts. Scaling improves the ability to simulate understanding, but it doesn’t grant the model direct access to the characters. The fundamental architectural limitation persists.
Conclusion and Implications
The CUTE benchmark provides a sobering look at the limitations of current LLMs. While these models can write poetry and code, their understanding of the very text they generate is abstracted. They are like master painters who can create photorealistic images but don’t understand the chemistry of the paint they are using.
Key Takeaways:
- Superficial Knowledge: LLMs know that
thereis spelledt-h-e-r-e, but they don’t use that information to process text. It’s a retrieved fact, not a structural reality. - Orthographic Blindness: Models struggle to identify words that look alike, which has implications for tasks like rhyme generation, pun explanation, or correcting typos based on visual similarity.
- The Tokenization Bottleneck: The primary hurdle is BPE tokenization. As long as models treat words as atomic integers, character-level logic will be a simulation rather than a native capability.
The authors conclude that for tasks requiring true character-level understanding (like cryptography, complex word games, or rigorous morphological analysis), the current architecture of LLMs is suboptimal. Future research may need to look toward character-level models (which process text byte-by-byte) or hybrid architectures to bridge the gap between semantic fluency and orthographic precision.
For now, the next time an LLM fails to reverse a string or solve a Wordle puzzle, you’ll know why: it’s not trying to be difficult; it literally cannot see the letters you’re talking about.
](https://deep-paper.org/en/paper/2409.15452/images/cover.png)