If you have been following the explosion of Natural Language Processing (NLP) in recent years, you know that the Transformer architecture is the engine behind the revolution. From GPT-4 to Claude, Transformers seem capable of mastering complex reasoning, coding, and creative writing. But in the research world, a fundamental question remains: Do we actually understand how they learn?
There is a significant body of theoretical work exploring what Transformers can represent. For example, we know mathematically that a Transformer is capable of mimicking an n-gram language model (a simple model that predicts the next word based on the previous \(n-1\) words). But just because a neural network can represent a function doesn’t mean it will actually learn that function from data using gradient descent.
This brings us to a fascinating paper titled “Can Transformers Learn n-gram Language Models?” The researchers Svete, Borenstein, et al. decided to strip away the complexity of natural language and test Transformers on a fundamental task: learning synthetic n-gram distributions. Their results reveal surprising insights about the “inductive biases” of Transformers—essentially, what kind of patterns they prefer to learn.
In this post, we will break down their methodology, the different types of n-gram models they tested, and why Transformers sometimes struggle to beat 1990s-era statistical techniques, while dominating in other areas.
The Fundamentals: What are we actually testing?
Before we dive into the experiments, we need to establish the ground rules. The paper compares modern neural networks against classical statistical models on the task of learning a probability distribution.
The Language Model (LM)
At its core, a language model is simply a probability distribution over strings of text. As shown in the equation below, we usually define this autoregressively: the probability of a whole sequence is the product of the probabilities of each token given its history.

Here, \(\mathbf{y}_{ Modern Large Language Models (LLMs) look at a massive context window. However, an n-gram model makes a simplifying assumption: the probability of the next token depends only on the previous \(n-1\) tokens. If \(n=3\) (a trigram model), the model only cares about the last two words to predict the third. This limitation makes n-gram models excellent “toy problems” for probing neural networks because we can mathematically define the “ground truth” distribution perfectly. The authors introduce a crucial distinction that drives the entire paper. There are two very different ways to define an n-gram model, and Transformers react to them differently. Imagine a model where the probability of the next word is completely arbitrary for every possible history. There is no relationship between “the cat” and “a dog”. They are just separate entries in a massive database. In a General n-gram LM, the parameters are distinct for every single context. There is no parameter sharing. To learn this, a model simply has to count how many times “sat” follows “the cat.” Now, imagine a model where the history is converted into a vector (a representation). This is called a Representation-Based n-gram LM. It uses a shared set of parameters (an output matrix \(\mathbf{E}\) and a representation function \(h\)) to calculate probabilities. In this framework, the probability isn’t a raw count; it’s a softmax function applied to a dot product of weights and a hidden state. The authors suspect that because Transformers are neural networks that rely on dense vector representations, they will struggle with the “General” type (pure memorization/counting) but excel at the “Representation-Based” type. To test this hypothesis, the researchers set up a “bake-off” between several types of models. They generated synthetic datasets where the ground truth was a specific n-gram model, and then tried to train various architectures to recover that distribution. In the era before deep learning, NLP relied on counting. The Maximum Likelihood Estimation (MLE) is the simplest approach: just count the frequencies. However, MLE fails when it sees a sequence in the test set that it never saw in training (assigning it zero probability). To fix this, researchers developed smoothing techniques. These models are “non-parametric” in the deep learning sense—they don’t learn weights via gradient descent; they just estimate statistics directly from data. They also introduced a variation on the Transformer using Sparsemax attention. Unlike standard Softmax, which assigns a non-zero probability to everything, Sparsemax can assign exactly zero attention to irrelevant words. Theoretical work suggests this should help Transformers learn n-gram structures more strictly. Since the researchers generated the data synthetically, they possess the “Ground Truth” probability distribution, denoted as \(p_n\). They want to measure how close the trained Transformer (\(p_\mathcal{T}\)) gets to this truth. The metric of choice is the Kullback–Leibler (KL) Divergence. The first major result comes from comparing how well these models learn “General” (arbitrary) n-gram LMs versus “Dense Representation-based” LMs. In a General n-gram LM, every context is independent. The optimal strategy is just to count. In a Dense Representation-based LM, symbols share parameters (conceptually similar to how word embeddings work). Let’s look at the results for models trained on 6-gram data (\(n=6\)): Look at the columns labeled “No” (under Parameter Sharing). Now look at the columns labeled “Yes”. Key Takeaway: Transformers have a strong inductive bias toward representation-based languages. They assume that similar inputs should yield similar outputs. When the data essentially says “memorize these random numbers,” Transformers fail compared to simple counting methods. But when the data says “learn the hidden features,” Transformers shine. The researchers then pushed the complexity of the tasks. They varied the order \(n\) (length of history), the vocabulary size \(|\Sigma|\), and the rank \(R\) of the matrix used to generate the data. They ran a regression analysis to see which factors make learning harder for Transformers. Looking at Table 4, we see the coefficients (\(\hat{\beta}\)) which predict the error (KL divergence): Standard Transformers use Softmax attention. This means that when the model looks back at the history, it attends to every previous token with at least some tiny probability. However, an n-gram model is strict: it cares about the specific tokens in the window \(n-1\), and nothing else. Theoretically, an attention mechanism that can set weights to exactly zero should be better at this. The researchers tested this by swapping Softmax for Sparsemax. The results in Table 9 are striking. Across almost every configuration of vocabulary size (\(\Sigma\)) and order (\(n\)), the Sparsemax Transformers (bottom row) achieve lower KL divergence (lower error) than standard Softmax Transformers. This suggests that for learning formal languages and strict logical structures, our current standard Transformer architecture (using Softmax) might be suboptimal. Being able to completely ignore irrelevant information is a powerful capability that Sparsemax provides. This research paper provides a reality check for how we view Large Language Models. By studying these “toy” models, we gain a clearer picture of the massive engines driving modern AI. They aren’t magic; they are statistical learners with very specific biases—biases that happen to align beautifully with the structure of human language.The n-gram Assumption

The Core Method: Two Ways to Build an n-gram
1. General n-gram LMs (The “Lookup Table”)
2. Representation-Based n-gram LMs (The “Neural” Approach)


The Contestants
The Baselines: Classic Smoothing

The Neural Challengers
How Do We Judge Success?

Experiment 1: The Inductive Bias of Transformers

The “General” Column (No Parameter Sharing)
The “Representation-Based” Column (Yes Parameter Sharing)
Experiment 2: Complexity and Scale


Experiment 3: Softmax vs. Sparsemax

Conclusion: What Does This Mean for AI?
](https://deep-paper.org/en/paper/2410.03001/images/cover.png)