How AI Learns the Unseen: The Mystery of the “Beautiful Five Days”
Imagine you are reading a book and you come across the phrase: “a beautiful five days.”
To a native English speaker, this sounds perfectly natural. You might say, “We spent a beautiful five days in Rome.” But if you pause and look at the grammar, something strange is happening. The word “a” is a singular article (used for one thing, like “a dog”). The phrase “five days” is plural. In strict grammatical terms, combining a singular article with a plural noun phrase should be a disaster. We don’t say “a days” or “a five dogs.” Yet, the “Article + Adjective + Numeral + Noun” (AANN) construction is perfectly acceptable English.
How do we learn this? And more importantly for modern technology, how do Large Language Models (LLMs) learn this?
A recent paper titled “Language Models Learn Rare Phenomena from Less Rare Phenomena” investigates this fascinating linguistic puzzle. The researchers tackle a massive question in AI: Do models simply memorize what they see, or do they actually generalize complex rules from sparse data? To find out, they trained language models from scratch, systematically hiding specific grammatical structures to see if the AI could “invent” them on its own.
The Debate: Memorization vs. Generalization
There is a prevalent criticism in the field of AI that Large Language Models are merely “stochastic parrots”—systems that stitch together memorized snippets of text without understanding the underlying rules. Because models like GPT-4 are trained on trillions of words, they have seen almost every possible sentence structure millions of times. If a model generates a rare sentence type, it’s hard to tell if it understands the grammar or if it’s just recalling a training example.
This makes studying human-like learning in AI difficult. Humans can determine if a sentence is grammatical even if they have never heard it before. Can LLMs do the same?
To solve this, the authors of this paper utilized a technique called “controlled rearing.” Instead of using a massive, pre-trained model, they trained smaller models (based on the OPT architecture) on a “human-scale” dataset called BabyLM (about 100 million words, roughly what a human child hears). This allowed the researchers to perform surgery on the training data—removing or altering specific sentences to test exactly how the model learns.
The Case Study: The AANN Construction
The researchers chose the AANN construction (“a beautiful five days”) as their target because it is a “rare phenomenon.” In the BabyLM corpus, AANNs make up only about 0.02% of the sentences. It is an edge case—a linguistic “long tail” event that defies standard agreement rules.
If an AI can learn this rare rule, how does it do it?
- Memorization: Does it need to see “a beautiful five days” to know it’s valid?
- Generalization: Can it learn the rule for AANNs by seeing other, more common phrases that share similar structural DNA?
The Core Method: Surgical Data Manipulation
The methodology is the heart of this paper. The researchers didn’t just train one model; they trained 114 different models from scratch to ensure statistical significance. They created several “counterfactual” versions of the training dataset.
As illustrated in the diagram below, the researchers systematically manipulated the input corpora.

Here is a breakdown of the different “worlds” they created for these AI models:
- Unablated (The Control): The model sees the normal text, including the rare AANN examples.
- No AANN: The researchers used complex regex patterns (text matching code) to find and delete every single instance of an AANN from the training data. This model lives in a world where “a beautiful five days” effectively doesn’t exist.
- Counterfactual Corruptions (ANAN / NAAN): In these datasets, they didn’t just remove the AANNs; they replaced them with grammatically incorrect nonsense using the same words.
- ANAN: “a five beautiful days” (Article-Numeral-Adjective-Noun)
- NAAN: “five beautiful a days” (Numeral-Adjective-Article-Noun)
Measuring Success with SLOR
How do you measure if a model “knows” a grammatical rule? You can’t just ask it. Instead, the researchers used a metric called SLOR (Syntactic Log-Odds Ratio).
Standard probability scores from an AI are biased by word frequency. A model might rate “a happy dog” as highly probable just because “happy” and “dog” are common words. SLOR adjusts for this. It measures how likely a sentence structure is, normalizing for how common the individual words are.

By using SLOR, the researchers could test if the model finds a sentence like “a whopping ninety LMs” acceptable because of its structure, not just because it recognizes the words.
Experiment 1: Learning Without Seeing
The first major finding was startling. The researchers took the model trained on the No AANN corpus—the model that had ostensibly never seen an AANN structure in its “life”—and tested it.
The results showed that LMs can learn the acceptability of AANNs without having seen a single positive instance.
While the model trained on normal data (Unablated) performed best (around 70% accuracy), the model that never saw an AANN still achieved 47% accuracy—far above the chance level of 20%.
Furthermore, the researchers compared this to the models trained on the “corrupted” data (ANAN and NAAN).

The figure above is crucial. It shows that even when the model was explicitly trained on “wrong” orders (like “a five beautiful days”), it didn’t learn those wrong orders as well as it learned the correct AANN structure zero-shot.
This implies the model isn’t a blank slate just memorizing word orders. It has learned a deeper grammar from the rest of the English language that makes “a beautiful five days” plausible, even if it hasn’t seen it before.
Lexical Constraints (The “Stubborn” Adjectives)
The generalization goes even deeper. In English, you can say “a beautiful five days” but you cannot say “a blue five pencils.” Certain adjectives (like colors) are “stubbornly distributive” and don’t work in this construction.
Did the models know this distinction?

As shown in Figure 3, even the model trained with No AANNs (the orange squares) patterned similarly to humans (black triangles). They both disliked the “stubborn” adjectives (right side of the graph) while accepting quantitative and qualitative adjectives. This confirms the model isn’t just guessing; it has inferred semantic constraints about what kind of words fit into this slot, purely from indirect evidence.
Experiment 2: The “Keys” to Generalization
If the model didn’t see the AANN, how did it learn it? The authors hypothesized that the model generalizes from related, more frequent phenomena.
They identified several “neighboring” grammatical structures that might serve as clues:
- “The” ANN: Phrases like “the beautiful five days.” The definite article “the” works with plurals easily. Maybe the model transfers this knowledge to “a”.
- “A few” / “A couple”: Phrases like “a few days” or “a couple bottles.” These are very common. They teach the model that [Article + Quantifier + Plural Noun] is valid.
- Singular Measure Phrases: Sentences like “Five miles is a long way.” Here, a plural phrase (“five miles”) takes a singular verb (“is”), treating the plural phrase as a single semantic unit.
To test this, the researchers performed “double ablations.” They removed the AANNs and one of these hypothesized “keys” to see if the model’s ability to guess the AANN would collapse.

The results (Figure 4, left side) confirmed the hypothesis.
- “Unablated” (top) has the highest score.
- “No AANNs” drops, but stays robust.
- However, when they removed “No Measure” phrases (singular measure nouns) or “No A few/couple”, the performance dropped significantly further.
This provides strong evidence for the “Keys to all of this” hypothesis. The model learns the rare “a beautiful five days” by triangulating between “a few days” and “five days is enough.” It builds a bridge from common structures to rare ones.
Experiment 3: The Role of Variability
Finally, the researchers asked: Does the diversity of the training data matter?
If a model sees “a beautiful five days” ten times, does it learn as much as if it sees ten different examples, like “a lovely three weeks,” “a strange two years,” etc.?
Theory suggests that high variability in the “slots” of a construction helps a learner understand that the construction is productive—meaning it’s a general rule, not a fixed phrase.

The results in Figure 5 support this. Models trained on a diet of High Variability AANNs (purple triangles) consistently outperformed those trained on Low Variability examples. When the model sees many different adjectives and nouns fitting into that sentence structure, it becomes more confident in generalizing that structure to new words.
Conclusion
This research offers a profound insight into the “black box” of Large Language Models. It suggests that these models are not merely “stochastic parrots” reciting memorized lines. Instead, they possess a sophisticated statistical learning mechanism capable of syntactic generalization.
The key takeaways are:
- Generalization is Real: Models can accept valid grammatical structures they have never seen.
- Indirect Learning: This learning relies on “bridges”—more common, related linguistic structures (like “a few days”).
- Data Quality Matters: Variability in the input data acts as a signal to the model that a grammatical pattern is flexible and productive.
By using “controlled rearing”—carefully manipulating the training data—the researchers provided an existence proof that LMs can navigate the “long tail” of language. They learn the rare by mastering the common. This brings us one step closer to understanding not just how machines learn, but perhaps shedding light on the statistical powers of the human mind as well.
](https://deep-paper.org/en/paper/2403.19827/images/cover.png)