If you have been following the explosion of Natural Language Processing (NLP) in recent years, you know that the Transformer architecture is the engine behind the revolution. From GPT-4 to Claude, Transformers seem capable of mastering complex reasoning, coding, and creative writing. But in the research world, a fundamental question remains: Do we actually understand how they learn?

There is a significant body of theoretical work exploring what Transformers can represent. For example, we know mathematically that a Transformer is capable of mimicking an n-gram language model (a simple model that predicts the next word based on the previous \(n-1\) words). But just because a neural network can represent a function doesn’t mean it will actually learn that function from data using gradient descent.

This brings us to a fascinating paper titled “Can Transformers Learn n-gram Language Models?” The researchers Svete, Borenstein, et al. decided to strip away the complexity of natural language and test Transformers on a fundamental task: learning synthetic n-gram distributions. Their results reveal surprising insights about the “inductive biases” of Transformers—essentially, what kind of patterns they prefer to learn.

In this post, we will break down their methodology, the different types of n-gram models they tested, and why Transformers sometimes struggle to beat 1990s-era statistical techniques, while dominating in other areas.

The Fundamentals: What are we actually testing?

Before we dive into the experiments, we need to establish the ground rules. The paper compares modern neural networks against classical statistical models on the task of learning a probability distribution.

The Language Model (LM)

At its core, a language model is simply a probability distribution over strings of text. As shown in the equation below, we usually define this autoregressively: the probability of a whole sequence is the product of the probabilities of each token given its history.

Equation 1: Definition of an autoregressive language model.

Here, \(\mathbf{y}_{

The n-gram Assumption

Modern Large Language Models (LLMs) look at a massive context window. However, an n-gram model makes a simplifying assumption: the probability of the next token depends only on the previous \(n-1\) tokens.

Equation 2: The n-gram assumption.

If \(n=3\) (a trigram model), the model only cares about the last two words to predict the third. This limitation makes n-gram models excellent “toy problems” for probing neural networks because we can mathematically define the “ground truth” distribution perfectly.

The Core Method: Two Ways to Build an n-gram

The authors introduce a crucial distinction that drives the entire paper. There are two very different ways to define an n-gram model, and Transformers react to them differently.

1. General n-gram LMs (The “Lookup Table”)

Imagine a model where the probability of the next word is completely arbitrary for every possible history.

  • If the history is “the cat”, the next word “sat” has probability 0.5.
  • If the history is “a dog”, the next word “sat” has probability 0.01.

There is no relationship between “the cat” and “a dog”. They are just separate entries in a massive database. In a General n-gram LM, the parameters are distinct for every single context. There is no parameter sharing. To learn this, a model simply has to count how many times “sat” follows “the cat.”

2. Representation-Based n-gram LMs (The “Neural” Approach)

Now, imagine a model where the history is converted into a vector (a representation).

  • “the cat” might result in a vector \(v_1\).
  • “a dog” might result in a vector \(v_2\).
  • If \(v_1\) and \(v_2\) are similar, the predicted probabilities for the next word will also be similar.

This is called a Representation-Based n-gram LM. It uses a shared set of parameters (an output matrix \(\mathbf{E}\) and a representation function \(h\)) to calculate probabilities.

Equation 20: Definition of a representation-based n-gram LM.

In this framework, the probability isn’t a raw count; it’s a softmax function applied to a dot product of weights and a hidden state.

Equation 4: The softmax function definition.

The authors suspect that because Transformers are neural networks that rely on dense vector representations, they will struggle with the “General” type (pure memorization/counting) but excel at the “Representation-Based” type.

The Contestants

To test this hypothesis, the researchers set up a “bake-off” between several types of models. They generated synthetic datasets where the ground truth was a specific n-gram model, and then tried to train various architectures to recover that distribution.

The Baselines: Classic Smoothing

In the era before deep learning, NLP relied on counting. The Maximum Likelihood Estimation (MLE) is the simplest approach: just count the frequencies.

Equation 10: Maximum Likelihood Estimation formula.

However, MLE fails when it sees a sequence in the test set that it never saw in training (assigning it zero probability). To fix this, researchers developed smoothing techniques.

  • Add-\(\lambda\) Smoothing: Adds a small pseudo-count (\(\lambda\)) to everything so nothing has zero probability.
  • Witten-Bell & Absolute Discounting: More complex methods that interpolate between higher-order (e.g., trigram) and lower-order (e.g., bigram) statistics.

These models are “non-parametric” in the deep learning sense—they don’t learn weights via gradient descent; they just estimate statistics directly from data.

The Neural Challengers

  1. Log-Linear Model: A simple regression model that tries to learn the weights for a “sparse” representation (one-hot encoding) of the history.
  2. Neural n-gram LM: A standard feed-forward neural network that learns embeddings for words, concatenates them, and predicts the next word.
  3. Transformers: The star of the show. The researchers used GPT-2 style architectures of varying sizes (layers and heads).

They also introduced a variation on the Transformer using Sparsemax attention. Unlike standard Softmax, which assigns a non-zero probability to everything, Sparsemax can assign exactly zero attention to irrelevant words. Theoretical work suggests this should help Transformers learn n-gram structures more strictly.

How Do We Judge Success?

Since the researchers generated the data synthetically, they possess the “Ground Truth” probability distribution, denoted as \(p_n\). They want to measure how close the trained Transformer (\(p_\mathcal{T}\)) gets to this truth.

The metric of choice is the Kullback–Leibler (KL) Divergence.

Equation 5: KL Divergence formula.

  • Low KL Divergence: The trained model has successfully learned the underlying rules of the language.
  • High KL Divergence: The trained model is failing to capture the true distribution.

Experiment 1: The Inductive Bias of Transformers

The first major result comes from comparing how well these models learn “General” (arbitrary) n-gram LMs versus “Dense Representation-based” LMs.

In a General n-gram LM, every context is independent. The optimal strategy is just to count. In a Dense Representation-based LM, symbols share parameters (conceptually similar to how word embeddings work).

Let’s look at the results for models trained on 6-gram data (\(n=6\)):

Table 2: Comparison of Classic, Log-Linear, Neural, and Transformer models.

The “General” Column (No Parameter Sharing)

Look at the columns labeled “No” (under Parameter Sharing).

  • Classic techniques (top row) achieve very low KL scores (e.g., 3.04). They are excellent at this task.
  • Transformers (TF) perform terribly here (e.g., 67.06).
  • Interpretation: Transformers struggle to approximate arbitrary distributions where there is no underlying structure or shared features between contexts. They are not efficient look-up tables.

The “Representation-Based” Column (Yes Parameter Sharing)

Now look at the columns labeled “Yes”.

  • Transformers drop from errors of ~67.0 down to ~4.6. They become highly effective.
  • Classic techniques do okay, but they cannot exploit the shared structure, so they hit a performance ceiling.
  • Neural models (Neural and TF) outperform the Log-Linear baseline significantly.

Key Takeaway: Transformers have a strong inductive bias toward representation-based languages. They assume that similar inputs should yield similar outputs. When the data essentially says “memorize these random numbers,” Transformers fail compared to simple counting methods. But when the data says “learn the hidden features,” Transformers shine.

Experiment 2: Complexity and Scale

The researchers then pushed the complexity of the tasks. They varied the order \(n\) (length of history), the vocabulary size \(|\Sigma|\), and the rank \(R\) of the matrix used to generate the data.

They ran a regression analysis to see which factors make learning harder for Transformers.

Table 1: Factors predicting KL Divergence.

Table 4: Statistical significance of predictors.

Looking at Table 4, we see the coefficients (\(\hat{\beta}\)) which predict the error (KL divergence):

  1. Dense (-1.28): This is the strongest predictor. If the dataset is Dense (representation-based), the error drops massively. This statistically confirms the result from Experiment 1.
  2. \(n\) (0.63): As the \(n\)-gram order increases, the task gets harder for the Transformer.
  3. Layers (L) and Heads (H): Notice the negative coefficients for \(L\) (-0.33) and \(H\) (-0.15). This means adding more layers and heads reduces the error. This aligns with theory suggesting that Transformers need specific depth/width to simulate specific \(n\)-gram orders.

Experiment 3: Softmax vs. Sparsemax

Standard Transformers use Softmax attention. This means that when the model looks back at the history, it attends to every previous token with at least some tiny probability.

However, an n-gram model is strict: it cares about the specific tokens in the window \(n-1\), and nothing else. Theoretically, an attention mechanism that can set weights to exactly zero should be better at this.

The researchers tested this by swapping Softmax for Sparsemax.

Table 9: Comparing Softmax and Sparsemax performance.

The results in Table 9 are striking. Across almost every configuration of vocabulary size (\(\Sigma\)) and order (\(n\)), the Sparsemax Transformers (bottom row) achieve lower KL divergence (lower error) than standard Softmax Transformers.

This suggests that for learning formal languages and strict logical structures, our current standard Transformer architecture (using Softmax) might be suboptimal. Being able to completely ignore irrelevant information is a powerful capability that Sparsemax provides.

Conclusion: What Does This Mean for AI?

This research paper provides a reality check for how we view Large Language Models.

  1. Transformers represent, but they don’t just “count”. While a Transformer can theoretically represent any n-gram model, it is not a good “counter.” If you need a model to strictly memorize frequency tables of unrelated events, a simple smoothed 3-gram model from the 1990s might actually beat a small Transformer.
  2. The Secret Sauce is Parameter Sharing. Transformers excel because they assume the world is structured. They assume that the “features” of the history define the probability of the future. This is why they generalize so well to natural language, which is full of such structures, unlike the random “General” n-gram distributions generated in this study.
  3. Architecture Matters. The superiority of Sparsemax attention in these experiments hints that we haven’t finished optimizing the Transformer block. For tasks requiring precision and strict dependency tracking (like formal logic or specific linguistic rules), sparse attention mechanisms might hold the key to better performance.

By studying these “toy” models, we gain a clearer picture of the massive engines driving modern AI. They aren’t magic; they are statistical learners with very specific biases—biases that happen to align beautifully with the structure of human language.