Introduction

Imagine reading the sentence: “The Golden Gate Bridge has been obnebulated every morning this week, limiting visibility.”

Unless you are an avid reader of 19th-century literature, you probably haven’t encountered the word obnebulated before. Yet, you likely understood the sentence perfectly. You know it’s a verb (thanks to the “-ed” suffix and its position after “has been”), and context clues about “visibility” suggest it means something like “clouded” or “fogged.”

Humans are excellent at this. We use syntactic and semantic cues to generalize our knowledge to words we rarely see. Large Language Models (LLMs), however, struggle with this. Despite their impressive capabilities, models like GPT-4 or RoBERTa rely heavily on frequency statistics. They learn “frequent” words exponentially better than “infrequent” words.

This reliance creates two major issues:

  1. Frequency Bias: Models often prefer a grammatically incorrect sentence containing common words over a grammatically correct sentence containing rare words.
  2. Anisotropy: The mathematical representations of words inside the model cluster into a narrow cone rather than spreading out, making it hard to distinguish between different rare words.

In this post, we are diving deep into a recent paper by Diehl Martinez et al., “Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing.” The researchers propose a clever tweak to the training objective—called Syntactic Smoothing—that forces the model to share knowledge between frequent and rare words based on their grammatical roles.

The Background: Why LLMs Struggle with Rare Words

To understand the solution, we first need to understand the “Zipfian” nature of language. Zipf’s law states that a small number of words are used very frequently (like “the,” “is,” “and”), while the vast majority of words are used very rarely (like “obnebulated”).

When an LLM is pre-trained using a Maximum Likelihood objective (predicting the next word), it sees frequent tokens millions of times and rare tokens only a handful of times. Consequently, the model optimizes the representations of frequent words aggressively, pushing rare words into a “degenerate” state where they don’t carry much useful information.

The Anisotropy Problem

This degeneration leads to a geometric problem in the model’s vector space called Anisotropy.

Ideally, word embeddings (the vector representations of words) should be isotropic—meaning they point in all directions in the vector space, utilizing the full capacity of the dimensions available. Anisotropy is the opposite: the vectors cluster together in a narrow cone.

When vectors are jammed into a narrow cone, the cosine similarity between any two random words becomes very high. This makes it difficult for the model to distinguish between semantically different but infrequent words.

The researchers use a mathematical definition to quantify this. While the formal definition of isotropy involves a partition function \(Z(c)\):

Equation for Isotropy involving partition function Z(c).

Where \(Z(c)\) is defined as:

Equation for the partition function Z(c) as a sum over the vocabulary.

In practice, this is too hard to calculate. Instead, the authors use an empirical approximation proposed by Ethayarajh. They measure anisotropy by taking the average cosine similarity between random pairs of words:

Equation for empirical anisotropy based on cosine similarity.

If this value is high (close to 1), it means all words are pointing in roughly the same direction—high anisotropy. If it is low, the embeddings are well-distributed.

Measuring Frequency Bias with BLiMP

How do we know if a model is biased by frequency? The researchers devised a clever metric using BLiMP (The Benchmark of Linguistic Minimal Pairs).

BLiMP consists of pairs of sentences: one grammatical and one ungrammatical. For example:

  • Grammatical: Grace’s piano teachers are known.
  • Ungrammatical: Grace’s piano teachers are replied.

The model should assign a higher probability to the grammatical sentence. However, the authors hypothesized that if the ungrammatical sentence contains more frequent words than the grammatical one, the model might get confused.

They developed a pipeline to quantify this:

  1. Calculate the frequency difference between the tokens in the pair.
  2. Isolate pairs where the grammatical sentence uses rare words and the ungrammatical sentence uses frequent words.
  3. Compare the model’s accuracy on these “hard” pairs vs. pairs where the frequency helps the model.

Figure 1: Illustration of the BLiMP frequency bias calculation. Step 1 shows calculating frequency difference. Step 2 shows the distribution. Step 3 shows the bias score calculation.

As shown in Figure 1, the difference in accuracy between the top third (frequency helps) and bottom third (frequency hurts) is the Frequency Bias. A high bias score means the model is cheating: it’s looking at word frequency rather than grammar.

The Core Method: Syntactic Smoothing

The standard way to train a Language Model is to use a “one-hot” target distribution. If the correct next word is “cat,” the target vector has a 1.0 for “cat” and 0.0 for every other word in the dictionary. The model is penalized if it predicts anything other than “cat.”

The problem? If the correct word is a rare word (e.g., “ocelot”), the model barely gets any chances to learn it.

The authors propose Syntactic Smoothing. Instead of putting 100% of the target probability on the correct word, we distribute some of that probability to other words that play the same syntactic role.

If the target is “ocelot,” we shouldn’t just tell the model “it’s ocelot.” We should tell it: “It’s ocelot, but it’s also similar to ‘cat’, ’tiger’, and ‘animal’.” This allows the rare word “ocelot” to benefit from the learning signals of frequent words like “cat.”

Step 1: Defining Syntactic Similarity

To do this, we need to know which words are syntactically similar. The authors use a Part-of-Speech (POS) proxy.

They run a POS tagger over the training data and build a distribution for every token. For example, the word “blind” can be a Noun, Verb, or Adjective. The word “the” is almost exclusively a Determiner.

We can visualize this difference in distributions:

Figure 2: POS distributions for ‘blind’ vs ’the’. ‘Blind’ has a diverse POS distribution, while ’the’ is almost purely a Determiner.

They then calculate the Syntactic Similarity between any two tokens \(i\) and \(j\) by taking the cosine similarity of their POS distribution vectors \(M\):

Equation for Syntactic Similarity using cosine similarity of POS vectors.

This creates a static matrix telling us how grammatically similar every word is to every other word.

Step 2: Smoothing the Backpropagation Signal

Now, the authors modify the loss function. They introduce a smoothing parameter \(\alpha\).

  • If \(\alpha = 0\), we do standard training (100% signal to the correct word).
  • If \(\alpha > 0\), we reserve \((1 - \alpha)\) for the correct word, and distribute the remaining \(\alpha\) to all other words based on their syntactic similarity.

The new target distribution \(t_i\) for a vocabulary token \(i\) (given the correct target is \(j\)) is:

Equation for the modified target distribution t_i.

To make the distribution sharper (so we don’t smear probability over too many words), they apply temperature scaling to the similarity scores:

Equation for temperature-scaled similarity scores.

Step 3: Pacing

Giving “partial credit” to syntactically similar words is great for learning representations, but eventually, the model needs to learn to predict the specific correct word to be useful.

The authors introduce Pacing: they start training with a high value of \(\alpha\) (lots of smoothing) and linearly decrease it to 0 by the end of training. This serves as a “syntactic scaffold”—supporting the model early on and removing the support as the model matures.

Experiments and Results

The researchers trained small RoBERTa models (approx. 125M parameters) on the BabyLM dataset (10 million tokens). They compared their Syntactic Smoothing (SyS) models against:

  1. Base Model: Standard RoBERTa training.
  2. Label Smoothing (LS): A standard technique where the signal is distributed uniformly across all other tokens, not based on syntax.

1. Reducing Frequency Bias

The primary goal was to stop the model from blindly trusting frequent words.

Figure 3: Frequency bias plotted for various models. Syntactic Smoothing (SyS) dramatically reduces bias compared to baselines.

As seen in Figure 3, standard models (OPT, RoBERTa, T5) have high frequency bias (gray bars). The authors’ Base Model (brown) also has high bias (~9.8). However, the Syntactic Smoothing (SyS) variants (red bars) drastically reduce this bias, with the “Mid” smoothing level bringing it almost to zero. This confirms that the model is learning to rely less on raw frequency statistics.

2. Reducing Anisotropy

Did the geometric “cone” problem improve?

Figure 4: Anisotropy learning dynamics. SyS models maintain lower anisotropy throughout training.

Figure 4 plots anisotropy over the course of training. The Base Model (dashed brown line) ends up with high anisotropy (~0.5). The Syntactic Smoothing models (solid lines) consistently maintain lower anisotropy.

Ideally, we want representations to be spread out. The Paced (SyS-P) models, which slowly turn off the smoothing, showed interesting behavior: even as the smoothing was removed, the anisotropy remained lower than the baseline.

We can zoom in on specific layers to see this effect more clearly:

Figure 5: Anisotropy dynamics across layers. The final layer shows a significant gap between the Base Model and SyS model.

Figure 5 compares the Base Model vs. the Paced Syntactic Smoothing model. Notice the Layer 7 (solid lines). In the Base Model (black), anisotropy skyrockets. In the SyS model (orange), it stays much lower. This suggests the final representations are much richer and distinct from one another.

One of the most interesting findings in the paper is the correlation between these two phenomena.

Figure 6: Scatter plot showing correlation between Frequency Bias and Anisotropy.

Figure 6 shows that models with high frequency bias tend to have high anisotropy. The Syntactic Smoothing models (red/orange markers) cluster in the bottom-left, indicating they have successfully mitigated both issues simultaneously. This supports the theory that “degenerate” representations of rare words are a root cause of frequency bias.

4. Alternative Similarity Metrics

You might wonder: is Cosine Similarity the best way to compare POS distributions? The authors also tested Jensen-Shannon (JS) Divergence as an alternative distance metric.

Equation for Jensen-Shannon divergence similarity.

The results were robust regardless of the metric used. As shown in Table 2 below, using JS divergence (SyS [JS]) achieved similar reductions in bias and anisotropy compared to the cosine version.

Table 2: Results comparing Bias, Anisotropy, and BLiMP scores for JS-based smoothing.

Conclusion

The “rich get richer” dynamic of token frequency in Large Language Models is a significant hurdle for true language understanding. When models ignore rare words in favor of common ones, they fail to capture the long tail of human language.

This paper provides a compelling solution: Syntactic Smoothing. By baking a “syntactic prior” into the loss function, the researchers forced the model to group rare words with their frequent syntactic cousins.

Key Takeaways:

  1. Rare words matter: Standard training leaves rare words in a “degenerate” state (Anisotropy).
  2. Smoothing helps: Distributing the learning signal based on POS tags rescues these rare words.
  3. Bias and Geometry are linked: Reducing the geometric clustering of vectors (Anisotropy) directly reduces the model’s tendency to cheat using frequency (Frequency Bias).

While this study focused on smaller models and English data, the implications are broad. As we strive for models that don’t just memorize statistics but actually understand structure, linguistically informed training objectives like Syntactic Smoothing may become essential tools in the deep learning toolbox.