If you have ever tried to learn a second language (L2) as an adult, you know the struggle. You might know the vocabulary, but you might find yourself instinctively arranging words using the grammar rules of your native language (L1). This phenomenon is known as L1 transfer. For example, a native Spanish speaker might say “the car red” instead of “the red car” because adjectives follow nouns in Spanish.
In the world of Natural Language Processing (NLP), researchers are increasingly asking: Can we simulate this cognitive process in machines? Can we build “L2 Language Models” that mimic how non-native speakers process English?
A recent paper, Modeling Nonnative Sentence Processing with L2 Language Models, tackles this exact question. The researchers investigated whether a Generative Pre-trained Transformer (GPT-2) trained sequentially on two languages would exhibit the same processing “quirks” as human second-language learners. Their findings shed light not just on how AI learns, but potentially on how we do, too.
The Cognitive Gap in AI
Most Large Language Models (LLMs) today are trained on massive, multilingual datasets simultaneously. While they are impressive polyglots, they don’t necessarily learn like humans. Humans learn sequentially: we master our L1 first, then layer an L2 on top of that foundation.
The authors of this paper wanted to bridge the gap between computational linguistics and psycholinguistics. They aimed to test two main hypotheses:
- Grammar Transfer: Does the model’s “native” language affect how well it learns English grammar? (e.g., does a Spanish-pretrained model learn English better than a Japanese-pretrained one?)
- Processing Similarity: Does the model’s processing difficulty (measured by “surprisal”) match the reading times of humans with the same native language background?
The Method: Freezing the “Brain”
To simulate a human learner who already has a grammatical system in place, the researchers utilized a technique called TILT (Test of Inductive Bias via Language Transfer).
The setup is a fascinating approximation of the human brain’s plasticity—or lack thereof—during adulthood. Here is how they constructed their “L2 Language Models” (L2LMs):
- L1 Training (Childhood): First, they trained a GPT-2 model from scratch on one of six specific First Languages (L1s): Arabic, Chinese, English, Japanese, Portuguese, and Spanish.
- The “Freeze” (Adulthood): Once the model learned its L1, the researchers froze the internal Transformer layers (the decoder blocks). These layers contain the abstract grammatical and structural rules of the language.
- L2 Training (Second Language Acquisition): They then continued training the model on English (L2). However, because the middle layers were frozen, the model could only update its embeddings (vocabulary) and its output layer.
This setup forces the model to process English words using the “grammatical wiring” of its original language.

As shown in Figure 1, the “Decoder Layers” remain frozen during the L2 stage. This simulates an L2 learner who has acquired new vocabulary (embeddings) but is still relying on their native language’s structural logic to stick sentences together.
The Experiment: Humans vs. Machines
To benchmark these models, the researchers needed human data. They used the CELER corpus, which contains eye-tracking data from 365 participants reading English sentences. Crucially, these participants came from the same six linguistic backgrounds as the models (Arabic, Chinese, English, Japanese, Portuguese, Spanish).
The comparison relied on a concept called Surprisal Theory. In psycholinguistics, “surprisal” measures how unexpected a word is given the context. The theory posits that the more surprised a brain is by a word, the longer it takes to process it (resulting in longer gaze duration).
The model calculates surprisal (\(S_{w_i}\)) as the negative log probability of the current word (\(w_i\)) given the previous context (\(w_{

If the L2LM is a good model of human processing, its surprisal values should correlate with human reading times. If the model is “surprised” by a word, the human should be too.
To quantify this, the researchers measured the Delta Log-Likelihood (\(\Delta LL\)). This metric represents how much better we can predict human reading times when we add the model’s surprisal data to a baseline regression model (which just looks at word length and position).

A higher \(\Delta LL\) means the AI model is effectively explaining the variations in human reading speed.
Results: Does Native Language Affect Proficiency?
First, the researchers looked at general proficiency. Did the “native” language of the AI make it better or worse at learning English?
They measured Perplexity (how confused the model is by English text; lower is better) and Grammatical Accuracy (using a benchmark called BLiMP).
The results confirmed a widely held linguistic theory: Typological Distance matters. Languages that are linguistically closer to English helped the model learn English better.

In Figure 2 above, look at the bottom-right of the graph (30M training tokens). The English-pretrained model (blue line) obviously performs best. But notice the order of the others:
- Spanish & Portuguese: Perform relatively well (low perplexity).
- Japanese: Performs the worst (highest perplexity).
This mirrors human Second Language Acquisition (SLA); it is generally easier for a Spanish speaker to learn English than it is for a Japanese speaker, due to shared alphabets and similar sentence structures.
This trend held true for grammatical accuracy as well:

Figure 3 shows the accuracy on the BLiMP benchmark. In the “4B \(\rightarrow\) 30M” block (fully trained L1, substantial L2 training), the models pretrained on Spanish and Portuguese scored higher than those trained on Chinese or Japanese. This confirms that the frozen internal layers—the “L1 grammar”—were indeed transferring positive or negative inductive biases to the English learning process.
The Plot Twist: Predicting Human Reading Times
Here is where the study produced a surprising result.
The researchers hypothesized that a Japanese-pretrained AI would be the best predictor of a Japanese human’s English reading times. The logic is sound: if both “systems” are processing English through a Japanese “lens,” they should stumble on the same complex phrases.
However, the data told a different story.

Take a close look at Figure 4 (top chart). The x-axis represents the Human’s Native Language. The colored bars represent the AI’s Native Language.
If the hypothesis were true, the matching color should always be the highest bar for each group (e.g., the brown Japanese bar should be highest in the “Japanese” column). It isn’t.
Instead, we see two things:
- No “Match” Effect: The choice of the Model’s L1 had very little effect on prediction accuracy.
- Human L1 Dominance: The prediction accuracy depends heavily on who the human is. It is much harder to predict the reading times of English natives (far left) than Japanese natives (far right).
This suggests that while the L2LMs are capturing some aspect of reading difficulty (since \(\Delta LL\) is positive), they aren’t capturing the specific “transfer effects” that distinguish a Spanish speaker’s processing from a Chinese speaker’s processing.
The Learning Curve: A Model of Proficiency?
While the “L1 matching” hypothesis failed, the researchers found a fascinating insight regarding proficiency.
They took a standard, monolingual English model and tracked how well it predicted reading times at different stages of its training. Does a “smarter” (fully trained) model always predict human behavior better?

Figure 5 reveals a distinct trajectory.
- Predicting Native Speakers (First plot, “English”): The model becomes a better predictor as it trains, peaking around 2 billion tokens.
- Predicting L2 Speakers (Other plots): The model peaks much earlier (around 800M - 1.2B tokens) and then degrades in predictive power.
Why? The authors propose that L2 English speakers (in this dataset) possess a proficiency level roughly equivalent to a model trained on ~1 billion words. As the model trains further and becomes “native-like” (learning complex, low-frequency patterns), it actually becomes less like the non-native human readers.
This suggests that to model L2 processing, we might not need “L2-specific” architectures, but rather “proficiency-matched” models—standard models halted at specific stages of development.
A Qualitative Look: Where Models Diverge
Even though the statistical correlation wasn’t strong, looking at specific sentences reveals that the models did acquire different biases.
Consider the sentence fragment: "…the number of occupied homes…"

In Figure 7, look at the spike at the word “occupied”.
- The models pretrained on Spanish (orange) and Portuguese (green) show huge surprisal spikes.
- The models pretrained on Chinese and Japanese show much lower surprisal.
The authors speculate this is due to word order. In Spanish and Portuguese, relative clauses usually come after the noun. In Chinese and Japanese, modifiers come before. Because the models are frozen in their L1 “ways,” the Romance-language models might have been expecting a noun or determiner, not a past-participle adjective like “occupied.” This shows that the TILT method does encode structural expectations, even if they don’t perfectly map to human eye-tracking data yet.
Conclusion
This research highlights the complexity of modeling the human brain with neural networks. The study successfully demonstrated that:
- L1 matters for AI proficiency: Pretraining on a linguistically similar language helps an AI learn a second language faster and more accurately.
- Proficiency curves exist: Standard English models mimic non-native reading patterns best when they are only partially trained.
However, the “Holy Grail” of L2 simulation—creating a model that perfectly mirrors the specific processing struggles of a specific L1 population—remains out of reach. The frozen L2LMs, while conceptually sound, didn’t align perfectly with human gaze data.
This work suggests that “surprisal” (mathematical probability) is a powerful tool, but human sentence processing involves complex resource allocation strategies that current architectures—even those designed to mimic L1 transfer—haven’t fully captured. As we continue to bridge the gap between AI and cognitive science, these “negative” results are just as valuable as the positive ones, guiding us toward more realistic models of the bilingual mind.
](https://deep-paper.org/en/paper/file-3398/images/cover.png)