The Basque Problem: Do AI Models Actually Understand Universal Grammar?

The debate over Artificial Intelligence and language is often framed as a battle between “nature” and “nurture.” On one side, you have the nativist view, championed historically by linguists like Noam Chomsky. This view argues that human beings are born with an innate “Universal Grammar”—a set of hard-wired constraints that allow children to learn complex languages from relatively little data. On the other side, you have the empiricist view, currently dominating the field of Deep Learning. This view posits that general-purpose learning algorithms (like Transformers), given enough data, can learn anything, including the complex rules of syntax, without any pre-wired grammatical knowledge.

If Large Language Models (LLMs) like Gemini or GPT-4 can master human language purely through statistical pattern matching, it deals a heavy blow to the argument for Universal Grammar. It suggests that syntax isn’t a biological hardware feature, but a statistical pattern that can be learned from scratch.

But have LLMs actually mastered the rules? Or have they just memorized the most frequent patterns of the most popular languages?

A recent paper by researchers at Google DeepMind and Johns Hopkins University, “Do LLMs learn a true syntactic universal?”, puts this question to the test. They investigate whether modern LLMs respect a specific, abstract linguistic law known as the Final-over-Final Condition (FOFC).

The results are fascinating and reveal a significant crack in the “AI learns everything” narrative. While models like Gemini Pro perform beautifully on high-resource languages like German and Russian, they fail spectacularly when tested on Basque. This failure suggests that while AI can mimic grammar given billions of examples, it hasn’t actually learned the universal principles that govern human language.

Background: The Architecture of Sentences

To understand the experiment, we first need to understand the linguistic rule being tested. Language isn’t just a string of words; it’s a hierarchical structure of nested phrases.

In linguistics, phrases have a “head”—the main word that determines the phrase’s nature (like the verb in a Verb Phrase). Languages generally fall into two categories regarding where they put this head:

Head-Initial: The head comes before its complements.

Example (English): “eat an apple.” The verb (head) comes before the object.

Head-Final: The head comes after its complements.

Example (Japanese): “ringo-o taberu” (apple eat). The verb comes last.

Many languages are consistent. English is consistently head-initial; Japanese is consistently head-final. However, many languages are “mixed.” They might use one order for Verb Phrases (VP) and a different order for Tense Phrases (TP) or other structures.

The Final-over-Final Condition (FOFC)

The Final-over-Final Condition is a proposed language universal. It doesn’t say a language must be purely head-initial or head-final. It allows for mixing, but it places a specific constraint on how you can mix them.

The rule essentially says: You cannot have a Head-Final phrase sticking out of the top of a Head-Initial phrase.

If we visualize sentences as trees, we can look at the “spine” of the tree.

It is okay to have a Head-Initial structure on top of a Head-Final one.
It is okay to be consistent (Initial over Initial, or Final over Final).
It is NOT okay to have a Head-Final structure sitting on top of a Head-Initial one.

This is an abstract constraint. It requires the speaker (or the model) to understand the hierarchical relationship between a “superphrase” (the parent) and a “subphrase” (the child).

Figure 1: The Final-Over-Final Condition bans head-final superphrases from having head-initial subphrases.

Look at Figure 1 above. This grid shows the four possible combinations of stacking two phrases (labeled \(\alpha\) and \(\beta\)).

1a and 1b are “harmonic” (consistent ordering). These are safe.
1c (Initial-over-Final) is “disharmonic” but allowed. This happens in languages like Finnish.
1d (Final-over-Initial) is the grayed-out box. This is the forbidden structure. According to the FOFC hypothesis, human languages simply do not generate this structure.

The researchers set out to answer two questions:

Is FOFC actually true in human languages? (A Corpus Study)
Do LLMs respect this rule even in languages they haven’t seen billions of times? (An LLM Evaluation)

Part 1: Proving the Rule (The Corpus Study)

Before testing the AI, the authors had to verify that the FOFC is indeed a universal tendency in natural language data. If human corpora are full of FOFC violations, it’s not a valid test for the AI.

The researchers analyzed massive amounts of text from the C4 (Colossal Clean Crawled Corpus) dataset. They focused on “mixed head direction” languages where violations were theoretically possible: Hungarian, Basque, Russian, Serbian, and German.

They used dependency parsing (converting sentences into structural trees) to look for the specific forbidden configuration: a head-final Auxiliary phrase dominating a head-initial Verb Phrase.

The results were overwhelming.

Table 1: Corpus study results showing significant adherence to FOFC.

As shown in Table 1, the Chi-squared (\(\chi^2\)) values are massive. This statistical test measures how far the observed data deviates from what we would expect by chance. The extreme values indicate that the lack of FOFC violations is not a coincidence—there is a strong pressure in these languages to avoid the forbidden structure.

Let’s look closer at specific languages.

The Hungarian Evidence

Hungarian is a complex language that allows for significant word order flexibility, making it a prime candidate for testing.

Table 2: Hungarian two-phrase configurations.

In Table 2, we see the counts for different structures. The column V < O represents head-initial verb phrases, while O < V represents head-final. The rows represent the order of the Auxiliary and the VP.

The Forbidden Cell (Top Right): This is where we have a Head-Initial VP (V < O) inside a Head-Final Aux (VP < Aux).
There are only 320 attestations of this structure out of millions of sentences.

When native speaking experts analyzed those 320 “violations,” they found that nearly all of them were parser errors—mistakes made by the software analyzing the sentence—rather than genuine grammatical structures. The FOFC holds.

The Basque Evidence

Basque is a language isolate—it is not related to any other known living language. It is also the critical test case for this paper because it has significantly less training data available on the internet compared to English or Russian.

Table 3: Basque two-phrase configurations.

Table 3 shows the Basque data. Again, look at the counts. The forbidden configuration (Top Right) has 1,632 instances, compared to nearly 7.1 million instances of the harmonic structure. Again, expert review confirmed these “violations” were mostly errors in sentence segmentation or tagging.

The corpus study confirms that humans, regardless of whether they speak Slavic, Uralic, or isolate languages, implicitly obey the Final-over-Final Condition.

Part 2: Testing the Machines

Now that the rule is established, the researchers turned to the Large Language Models: Gemini Pro and PaLM.

The methodology here is clever. You can’t simply ask an LLM, “Is this sentence grammatical?” because models often hallucinate or struggle with linguistic jargon. Instead, the researchers used a Targeted Syntactic Evaluation using “minimal pairs.”

Creating Synthetic Violations

To test if the model “feels” the violation, the researchers took real, grammatical sentences from the dataset and applied a tree transformation script. This script mechanically rotates the branches of the sentence tree to force it into the forbidden “Final-over-Initial” (1d) configuration.

Figure 2 and 3: Effect of FOFC on German acceptability and the transformation process.

Figure 3 (bottom) visualizes this transformation. They take a valid sentence structure (like 1a or 1c) and twist the dependency tree to create the forbidden 1d structure.

They then feed both the original (grammatical) sentence and the twisted (ungrammatical) sentence to the LLM. They measure the log-probability assigned to each sentence.

If the LLM has learned the universal, it should assign a higher probability (lower “perplexity”) to the valid sentence.
It should assign a lower probability (higher penalty) to the forbidden 1d structure.

The metric used is the Penalty:

\[ \text{Penalty} = \log P(\text{Valid Sentence}) - \log P(\text{Forbidden Sentence}) \]

If the Penalty is greater than 0, the model correctly prefers the grammatical structure. If it is near 0 or negative, the model has failed to learn the constraint.

Results: The Basque Gap

The results reveal a stark divide between high-resource languages and lower-resource languages.

The researchers plotted the distribution of penalties for thousands of sentence pairs. In the graphs below, a “pass” is a bell curve shifted to the right of the yellow dashed line (zero).

Figure 4: Results for Gemini Pro and PaLM 8B.

Look closely at Figure 4.

German, Hungarian, Russian, Serbian: For both Gemini Pro and PaLM 8B, the red mean line is solidly to the right. The distributions are clearly positive. The models “know” that the FOFC-violating sentences are wrong. They have learned the syntax.
Basque: Look at the top-left graph for Gemini Pro. The distribution is centered almost exactly on the yellow line (Mean = -2.3). The model is indifferent. In fact, it often prefers the ungrammatical, forbidden structure! PaLM 8B (bottom-left) performs slightly better but still has the majority of the distribution in the negative or near-zero region.

This is the smoking gun. The models have not learned the FOFC as a universal rule. If they had, they would apply it to Basque just as they apply it to German. Instead, they have learned the specific statistical patterns of German, Russian, and Hungarian because they have seen billions of examples of those languages.

Basque, having less data, did not provide enough statistical signal for the model to derive the rule from scratch.

Why Does Basque Fail?

The authors explore two main hypotheses for why the models failed on Basque: Model Size and Data Size.

Is the Model too small?

One theory in Deep Learning is “scaling laws”—the idea that emergent abilities (like logic or complex syntax) only appear when a model gets big enough.

The researchers tested this by running the experiment on PaLM models of increasing size, from 8 billion parameters up to 540 billion.

Figure 5: The results with different sizes of PaLM model.

As Figure 5 shows, scaling up the model does help. The 540B parameter model (right column) pushes the Basque distribution slightly more to the right compared to the 8B model. However, it doesn’t solve the problem. Even the massive 540B model struggles with Basque compared to how easily the smaller 8B model handles German. Brute force computing power isn’t a replacement for understanding the rule.

Is it the Training Data?

This seems to be the deciding factor. The authors listed the training data sizes for the languages in question.

Table 6: Size of PaLM training data.

Table 6 puts the numbers in perspective:

German: ~26 Billion tokens.
Russian: ~4 Billion tokens.
Basque: 153 Million tokens.

There is a threshold of data required for a neural network to “induce” a complex syntactic rule like FOFC from raw text. German is well above that threshold. Basque, with 153 million tokens, is below it.

Interestingly, Serbian has roughly 373 million tokens—not that much more than Basque—yet the models learned FOFC in Serbian quite well. Why? The authors suggest this is due to transfer learning. Serbian is very similar to Croatian and Bosnian. When combined, the South Slavic languages provide a much larger pool of syntactically similar data (over 1 billion tokens).

Basque, being a language isolate, has no neighbors to help it. The model is on its own, and 153 million tokens isn’t enough for the “nurture-only” approach of current LLMs to derive the Final-over-Final Condition.

Conclusion: The Case for Inductive Bias

This research provides a nuanced answer to the question “Do LLMs learn language universals?”

The answer is: No, they learn data distributions.

If a language universal (like FOFC) is abundantly present in the training data (as in German), the LLM will simulate it perfectly. But if the data is scarce (as in Basque), the LLM fails to generalize the rule, even if it is a rule that applies to all human languages.

This finding is critical because human children do not need 26 billion words to learn Basque syntax. A child learns their native language with a “budget” of roughly 100 million words over several years. The fact that Gemini Pro fails on a dataset (153M tokens) that is roughly the size of human developmental experience suggests that the “blank slate” architecture of Transformers is missing something.

The authors conclude that for AI to truly achieve human-like language competence—especially in low-resource languages—we can’t just rely on feeding them more text. We may need to re-evaluate the nativist argument: perhaps our models, like human children, need inductive biases. They might need architectural constraints that predispose them to learn tree structures and hierarchical rules, rather than just predicting the next word based on flat statistics.

Until then, LLMs remain impressive statistical mimics, but they are not yet universal grammarians.

Background: The Architecture of Sentences#

The Final-over-Final Condition (FOFC)#

Part 1: Proving the Rule (The Corpus Study)#

The Hungarian Evidence#

The Basque Evidence#

Part 2: Testing the Machines#

Creating Synthetic Violations#

Results: The Basque Gap#

Why Does Basque Fail?#

Is the Model too small?#

Is it the Training Data?#

Conclusion: The Case for Inductive Bias#