Do LLMs Follow Rules or Just Statistics? Investigating Binomial Ordering
Have you ever stopped to wonder why you say “bread and butter” rather than “butter and bread”? Or why “ladies and gentlemen” sounds natural, while “gentlemen and ladies” feels slightly jarring?
In linguistics, these pairings are called binomials. They consist of two nouns joined by a conjunction, usually “and.” While the meaning of “salt and pepper” is identical to “pepper and salt,” native English speakers have strong, often rigid preferences for the ordering of these words.
For decades, linguists have debated why we prefer one order over the other. The consensus is that human language processing relies on two distinct mechanisms: observed preferences (we say it that way because we’ve heard it that way a million times) and abstract representations (we follow invisible rules, such as placing shorter words before longer words, or animate objects before inanimate ones).
With the meteoric rise of Large Language Models (LLMs) like GPT-4 and Llama, a new question arises: Do these models learn language the way we do? Do they actually learn the abstract rules of English, or are they simply parroting the frequency statistics of their massive training datasets?
A recent paper by researchers from the University of California, Davis—Zachary Houghton, Kenji Sagae, and Emily Morgan—dives deep into this question. By analyzing how LLMs handle binomials, they provide fascinating insights into the “black box” of neural language processing. Their findings suggest a fundamental divergence between human cognition and artificial intelligence.
The Human Context: Rules vs. Experience
To understand the significance of this study, we first need to understand how humans process these word pairs.
When a human encounters a common phrase like “fish and chips,” our brain retrieves it largely based on observed preference. We have heard this specific sequence so frequently that it is stored as a chunk.
However, humans are also incredible generalization machines. If you were presented with two words you had never heard combined before—let’s say “alibis” and “excuses”—you would likely still have a preference for the order “alibis and excuses.” Why? Because humans utilize abstract ordering preferences. We implicitly know a set of phonological and semantic constraints:
- Length: We prefer short words before long words.
- Rhythm: We prefer specific stress patterns.
- Semantics: We often put the more “powerful” or “animate” word first.
Psycholinguistic research shows that for low-frequency items (phrases we rarely hear), humans rely heavily on these abstract rules. We don’t need to have memorized the statistics of a phrase to know how it should sound.
The core question of this research paper is whether LLMs have developed this same capability. Have models like Llama-3 or GPT-2 learned these abstract constraints? Or is their ability to order words driven entirely by how often they saw those specific words in their training data?
The Methodology
To test this, the researchers set up a rigorous experiment involving eight different Large Language Models ranging significantly in size, from the small GPT-2 (124 million parameters) to the massive Llama-3 (70 billion parameters).
The Dataset
They utilized a specialized corpus of 594 binomial expressions. For each expression, they had three crucial pieces of data:
- Observed Preference: How often the alphabetical ordering appears in the massive Google N-grams corpus (representing the “memorization” factor).
- Abstract Ordering Preference: A calculated score (from 0 to 1) based on phonological and semantic rules, predicting how a human should order the words if they were following linguistic rules, independent of frequency.
- Overall Frequency: How common the phrase is in general English usage.
Calculating Model Preferences
The researchers didn’t simply ask the chatbots “Which do you prefer?” Instead, they looked at the raw probabilities the models assigned to the words.
They calculated the probability of the model generating the binomial in alphabetical order (Word A and Word B). To ensure the context was neutral, they used the prefix “Next item: “.

As shown in the equation above, the probability of the alphabetical form (\(P_{alphabetical}\)) is the product of three probabilities:
- The probability of Word A appearing after the prefix.
- The probability of “and” appearing after Word A.
- The probability of Word B appearing after “and”.
They then did the same thing for the non-alphabetical order (Word B and Word A):

With these two probabilities in hand, they could determine the model’s preference. They calculated the Log Odds Ratio, which serves as a single number representing how strongly the model prefers the alphabetical order over the reverse.

If the result is positive, the model prefers the alphabetical order. If negative, it prefers the non-alphabetical order. The magnitude of the number indicates the strength of the preference.
The Statistical Model
This is where the analysis gets clever. The researchers didn’t just look at accuracy; they wanted to disentangle why the model made its choice. They used a Bayesian linear regression model to predict the Log Odds (the model’s choice).

Let’s break down this formula, as it is the heart of the experiment:
- LogOdds(AandB): This is what we are trying to predict (the LLM’s behavior).
- AbsPref: This variable represents the abstract rules (human-like generalization).
- ObservedPref: This variable represents the statistics from the training data (memorization).
- Freq: The total frequency of the phrase.
- Freq : AbsPref & Freq : ObservedPref: These are interaction terms. They allow the researchers to ask: “Does the model rely on rules more when the phrase is rare?” (which is what humans do).
If LLMs are like humans, we would expect to see a significant weight on AbsPref, especially for low-frequency items. If LLMs are just statistical parrots, we would expect ObservedPref to dominate the equation.
Experiments & Results
The results were strikingly consistent across every model tested, regardless of size or architecture.
The researchers analyzed the “Beta Coefficients” for each variable. A high beta coefficient means that variable was a very strong predictor of the model’s behavior. A coefficient near zero means that variable had almost no influence on the model’s decision.
The Domination of Observed Preferences
The visual summary of the results is stark. In the figure below, look at the position of the dots for each model. The pink dot represents ObservedPref (Observed Preference), and the yellow dot represents AbsPref (Abstract Preference).

In every single subplot—from the tiny GPT-2 on the left to the massive Llama-3 70B on the right—the pink dot (ObservedPref) is high up on the Y-axis (between 3.0 and 6.0). This indicates a massive reliance on the specific ordering statistics found in the training data.
Conversely, look at the yellow dot (AbsPref). It hovers consistently around the zero line. This suggests that the abstract linguistic rules that guide human speech have almost zero explanatory power for how these models order words.
The Frequency Interaction
Another critical finding lies in the interaction terms. The researchers found a positive interaction between frequency and observed preference (ObsPref:Freq).
This means that the models rely on observed statistics even more strongly when the item is high-frequency. This makes sense: if a model has seen “bread and butter” billions of times, the statistical signal is incredibly strong.
However, critically, they found no interaction effect between frequency and abstract preference (AbsPref:Freq). Recall that humans rely more on abstract rules when a phrase is rare. The models did not show this behavior. Even when a phrase was low-frequency (meaning the model hadn’t memorized it as strongly), the model still did not revert to using abstract rules. It just had a weaker preference overall.
The Data in Detail
The table below provides the specific numerical values for these findings.

If you look at the “Est.” (Estimate) column for AbsPref across the models, you see values like -0.52, 0.69, or 0.23, often with error bars (credible intervals) that cross zero. This confirms that the effect is statistically negligible.
Compare this to Observed (Observed Preference), where estimates range from 3.07 to 5.64, with tight error bars that are far from zero. The statistical evidence is overwhelming: LLMs are driven by what they have seen, not by internalized rules of “what sounds good.”
Discussion: What This Means for AI
This study provides a sobering look at the nature of “intelligence” in Large Language Models.
The Human-AI Divergence
The most significant takeaway is the qualitative difference between human and machine language processing. Humans are efficient learners. We don’t need to hear every possible combination of words to know how to order them; we learn the underlying pattern (the abstract representation) and apply it to new situations.
LLMs, on the other hand, appear to be brute-force learners. They achieve fluent, human-like output not by learning the “rules of the game,” but by memorizing the game’s history. When they produce “bread and butter,” they aren’t applying a phonological rule about short words coming first; they are simply completing a statistical pattern based on the trillions of tokens they have processed.
The “Alibis and Excuses” Problem
The authors highlight an interesting example: “alibis and excuses.” This is a low-frequency binomial. A typical college-aged human might have heard this phrase only once or twice in their life, yet they would likely agree on this ordering due to abstract preferences.
The study shows that LLMs rely exclusively on observed preferences even for these rare items. If the specific n-gram count in the training data doesn’t favor one order, the LLM is essentially guessing, whereas a human would use intuition (abstract rules).
Scale Didn’t Fix It
Perhaps the most surprising aspect of the results is that scale didn’t matter. One might hypothesize that abstract reasoning is an emergent property—that once a model gets big enough (like Llama-3 70B), it “grokks” the rules.
The data refutes this. The Llama-3 70B model showed the same reliance on observed preference and the same ignorance of abstract preference as the tiny GPT-2. This suggests that simply adding more parameters and more data does not necessarily lead to the emergence of human-like abstract linguistic representations, at least not in the domain of binomial ordering.
Conclusion
The paper “The Role of Abstract Representations and Observed Preferences in the Ordering of Binomials in Large Language Models” offers a crucial piece of the puzzle in understanding how LLMs work.
While LLMs can generate text that is indistinguishable from human writing, the process by which they arrive at that text is fundamentally different. Humans trade off between memory and rules. LLMs appear to rely almost exclusively on memory (observed statistics).
This distinction is vital for researchers and developers. It highlights a limitation in current architectures: the inability to generalize to novel inputs using abstract rules in the same way humans do. While LLMs are impressive statistical mimics, they have not yet mastered the abstract “instincts” that govern human language.
As we move toward even larger models, studies like this remind us to look under the hood. We must ask not just if the model got the right answer, but how it got there. In the case of “bread and butter,” the AI gets it right—but for entirely the wrong reasons.
](https://deep-paper.org/en/paper/file-2390/images/cover.png)