The “Wug” Test for AI: Do LLMs Learn Like Humans?

If you have ever taken an introductory linguistics class, you are likely familiar with the “Wug Test.” In 1958, Jean Berko Gleason showed children a picture of a bird-like creature and said, “This is a wug.” She then showed two of them and said, “Now there are another one. There are two of them. There are two…?” The children correctly answered “wugs.”

This test demonstrated that children don’t just memorize words; they internalize abstract grammatical rules (like adding an “s” for plurals) and apply them to words they have never heard before. They do this with remarkable efficiency, often inferring rules from indirect evidence—context clues that aren’t explicit instructions.

Today, we face a new question: Do Large Language Models (LLMs) possess this same ability?

Recent models like GPT-4 and Llama are trained on trillions of tokens—thousands of times more data than a human child hears. Yet, their data efficiency is surprisingly poor compared to humans. A recent paper, “Can Language Models Induce Grammatical Knowledge from Indirect Evidence?”, investigates this discrepancy. By subjecting language models to a modernized, digital version of the Wug Test, the researchers explore whether AI can truly generalize from indirect clues or if it merely relies on rote memorization.

The Core Problem: Direct vs. Indirect Evidence

To understand the gap between human and machine learning, we first need to categorize how we learn. In language acquisition theory, evidence is often split into two categories:

Direct Evidence: You see a specific sentence structure with specific words, and you learn that specific combination is valid.
Indirect Evidence: You see a sentence structure in one context and infer that the grammatical rule applies to other contexts, even if the words are different.

Humans are masters of indirect evidence. If you learn that “The wug is eating,” you can infer that “The wug eats” is also grammatical, even if you’ve never heard the latter sentence. You infer the properties of the noun “wug” and the verb “eat” separately and recombine them.

The researchers hypothesize that this ability to leverage indirect evidence is a key “inductive bias” that allows humans to learn languages so efficiently. To test if Language Models share this trait, they designed the Wug In-Direct Evidence Test (WIDET).

Figure 1: The indirectness of evidence. Direct evidence refers to instances identical to previously observed ones. Lexically indirect evidence targets the same linguistic knowledge but differs in lexical items. Syntactically & lexically indirect evidence is different in both their syntactical and lexical items.

As illustrated in Figure 1 above, the study distinguishes between three levels of evidence:

Direct Evidence (DE): The model is trained on the exact sentence it will be tested on (e.g., “wug loves himself”).
Lexically Indirect Evidence (LexIE): The model sees the same grammatical structure but with different surrounding words (e.g., “wug is helping himself”). The syntax is the same, but the vocabulary changes.
Syntactically Indirect Evidence (SynIE): The model sees the target word (wug) in a completely different sentence structure (e.g., “wug helped his friend”). The model must infer properties like gender or transitivity from a different context entirely.

Methodology: Injecting “Wugs” into the Matrix

How do you test if a model can learn a word it has never seen before? You invent one.

The researchers took a standard pretraining dataset (English Wikipedia) and injected synthetic sentences containing a newly coined token: <wug#n>. This ensures the model has zero prior knowledge of the word. They then trained a BabyBERTa model (a smaller, developmentally plausible version of RoBERTa) on this modified data.

The crucial variable was frequency. They injected these “wug” sentences at varying rates—from 0 to 100 observations—to see how much data the model needed to “get it.”

The Linguistic Phenomena

The team tested the models on seven different linguistic phenomena to see if the type of grammatical rule mattered. These included:

Anaphor Agreement: Matching pronouns to subjects (e.g., she… herself).
Transitivity: Knowing if a verb requires an object.
Subject-Verb Agreement: Singular vs. Plural matching (e.g., The wugs run vs. The wug runs).

The table below details the specific sentence structures used for training (the input) and evaluation (the test). Notice how the LexIE and SynIE training instances differ from the evaluation instances, requiring the model to bridge a gap in logic.

Table 1: Linguistic phenomena and instances. The sentences starting with * are ungrammatical.

The evaluation used a “minimal pair” approach. The model is presented with two sentences: a grammatical one and an ungrammatical one (marked with an asterisk *). The model “passes” the test if it assigns a higher probability to the grammatical sentence.

The Results: A Story of inefficiency

The results of the experiments reveal a stark contrast between direct memorization and true generalization.

The researchers tracked the model’s accuracy as the number of “wug” observations increased from 0 to 100. If models learned like humans, we would expect them to pick up rules quickly from indirect evidence (LexIE and SynIE).

However, the data tells a different story.

Figure 2: The results (accuracy; % ) of experiments for language phenomena and evidence. The gray dot lines indicate the model’s scores trained on pretraining data without any additional instances ( n = 0 )

Let’s break down the graphs in Figure 2:

Direct Evidence (Blue Line): Unsurprisingly, this works best. When the model sees the exact sentence during training, its accuracy shoots up quickly. This is rote learning.
Lexically Indirect Evidence (Orange Line): This is where cracks begin to show. Even though the grammatical structure is identical to the test sentence (just with different words), the model learns significantly slower than with direct evidence. In some cases, like Transitive verbs, it struggles to reach high accuracy even after 100 observations.
Syntactically Indirect Evidence (Green Line): This is the most concerning result. In almost every category, generalization from syntactically different contexts is abysmal. For Transitive verbs (top row, middle), the performance actually decreases as the model sees more examples.

The “Transitive” Anomaly

Why did the performance drop for Transitive verbs in the SynIE setting? The authors suggest this reveals a flaw in how LLMs generalize.

In the SynIE training data for transitivity (refer back to Table 1), the sentence structure was: “every lion hunts what no prey can [wug]”. Here, wug is a verb. However, structurally, it appears at the end of the sentence. In the evaluation, the correct structure is “some trees [wug]ed the car”, where wug is followed by an object.

The model likely adopted a “linear heuristic”—it learned that wug appears at the end of sentences and assumed it cannot be followed by an object. It failed to learn the hierarchical grammatical rule (that wug is a transitive verb) and instead overfit to the surface-level word order. This is a classic failure mode known as “linear generalization,” distinct from the “hierarchical generalization” humans use.

Deep Dive: The Problem of Distance and Noise

One hypothesis for why the models struggled is “distance.” In sentences like Anaphor Gender Agreement (“The wug… has devoted herself”), the subject and the reflexive pronoun are separated by other words.

The researchers wanted to know: Does adding more words (distance) or confusing words (attractors) between the subject and the pronoun break the model’s learning process?

They designed a deeper analysis using Interference Types:

Attractors: Inserting nouns of the opposite gender between the wug and herself (e.g., “The wug helping the man loves herself”).
Distance: Simply adding more neutral words to increase the gap.

Table 3: Interference types and training instances used in the analysis. <w> corresponds to <wug#n>.

The results of this stress test were illuminating.

Figure 3: Models’ scores for more indirect instances.

As shown in Figure 3, the presence of an “Attractor” (specifically AT2, which includes an opposite-gender pronoun like “him”) caused significant instability in learning. The model gets confused by the intervening pronoun, struggling to link “herself” back to the main subject “wug.”

Interestingly, pure distance (DT1, DT2) wasn’t as destructive as active attractors. This suggests that the models aren’t just forgetting the subject because it’s far away; they are getting distracted by other potential (but incorrect) subjects along the way.

Does the “Wug” Method Matter?

A valid critique of this study is that sticking a token like <wug#n> into a sentence isn’t quite natural. In the original human Wug Test, the words sounded like real English words (phonologically plausible).

To address this, the authors ran a comparison between their “Tag” method and a “Wug” method where they used a pseudo-word generator to create pronounceable fake words (like blick or nad).

Table 4:Scores calculated by the models trained on the pretraining data with indirect instances of different wug creation methods. N is the number of observations.

Table 4 shows something fascinating: In the “Wug” method (using realistic pseudo-words), the model achieved high accuracy (over 80%) with zero observations.

Why? Because the tokenizer broke the pseudo-words down into sub-words that carried meaning or looked like English pluralizations (e.g., ending in ’s’). The model was cheating by using existing knowledge of English morphology. By using the abstract <wug#n> tag, the researchers successfully isolated the learning process, proving that their method was a stricter, more accurate test of learning from scratch.

Conclusion: The Data Efficiency Gap

The findings of the WIDET experiments paint a complex picture of modern AI. While Large Language Models are incredibly capable, their learning mechanisms fundamentally differ from human cognition in specific ways:

Inefficiency with Indirect Evidence: Unlike humans, who rapidly infer rules from varied contexts, LMs struggle to generalize unless the training data closely resembles the test data (Direct Evidence).
Surface-Level Heuristics: The failure in the Transitive experiments suggests models often rely on simple word-order patterns rather than deep grammatical structures.
Vulnerability to Distraction: Intervening words (attractors) easily disrupt the model’s ability to link subjects and verbs/pronouns, hindering learning in complex sentences.

This paper highlights a major frontier for future research. To build AI that learns as efficiently as a human child, we cannot simply scale up the data. We need architectures and training objectives that encourage the use of indirect evidence—allowing models to connect the dots between disparate contexts, rather than just memorizing the dots themselves.

Until then, LLMs remain powerful, but perhaps not as linguistically clever as a preschooler passing the Wug Test.

The “Wug” Test for AI: Do LLMs Learn Like Humans?#

The Core Problem: Direct vs. Indirect Evidence#

Methodology: Injecting “Wugs” into the Matrix#

The Linguistic Phenomena#

The Results: A Story of inefficiency#

The “Transitive” Anomaly#

Deep Dive: The Problem of Distance and Noise#

Does the “Wug” Method Matter?#

Conclusion: The Data Efficiency Gap#