Introduction
Imagine reading the sentence: “The Golden Gate Bridge has been obnebulated every morning this week, limiting visibility.”
Unless you are an avid reader of 19th-century literature, you probably haven’t encountered the word obnebulated before. Yet, you likely understood the sentence perfectly. You know it’s a verb (thanks to the “-ed” suffix and its position after “has been”), and context clues about “visibility” suggest it means something like “clouded” or “fogged.”
Humans are excellent at this. We use syntactic and semantic cues to generalize our knowledge to words we rarely see. Large Language Models (LLMs), however, struggle with this. Despite their impressive capabilities, models like GPT-4 or RoBERTa rely heavily on frequency statistics. They learn “frequent” words exponentially better than “infrequent” words.
This reliance creates two major issues:
- Frequency Bias: Models often prefer a grammatically incorrect sentence containing common words over a grammatically correct sentence containing rare words.
- Anisotropy: The mathematical representations of words inside the model cluster into a narrow cone rather than spreading out, making it hard to distinguish between different rare words.
In this post, we are diving deep into a recent paper by Diehl Martinez et al., “Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing.” The researchers propose a clever tweak to the training objective—called Syntactic Smoothing—that forces the model to share knowledge between frequent and rare words based on their grammatical roles.
The Background: Why LLMs Struggle with Rare Words
To understand the solution, we first need to understand the “Zipfian” nature of language. Zipf’s law states that a small number of words are used very frequently (like “the,” “is,” “and”), while the vast majority of words are used very rarely (like “obnebulated”).
When an LLM is pre-trained using a Maximum Likelihood objective (predicting the next word), it sees frequent tokens millions of times and rare tokens only a handful of times. Consequently, the model optimizes the representations of frequent words aggressively, pushing rare words into a “degenerate” state where they don’t carry much useful information.
The Anisotropy Problem
This degeneration leads to a geometric problem in the model’s vector space called Anisotropy.
Ideally, word embeddings (the vector representations of words) should be isotropic—meaning they point in all directions in the vector space, utilizing the full capacity of the dimensions available. Anisotropy is the opposite: the vectors cluster together in a narrow cone.
When vectors are jammed into a narrow cone, the cosine similarity between any two random words becomes very high. This makes it difficult for the model to distinguish between semantically different but infrequent words.
The researchers use a mathematical definition to quantify this. While the formal definition of isotropy involves a partition function \(Z(c)\):

Where \(Z(c)\) is defined as:

In practice, this is too hard to calculate. Instead, the authors use an empirical approximation proposed by Ethayarajh. They measure anisotropy by taking the average cosine similarity between random pairs of words:

If this value is high (close to 1), it means all words are pointing in roughly the same direction—high anisotropy. If it is low, the embeddings are well-distributed.
Measuring Frequency Bias with BLiMP
How do we know if a model is biased by frequency? The researchers devised a clever metric using BLiMP (The Benchmark of Linguistic Minimal Pairs).
BLiMP consists of pairs of sentences: one grammatical and one ungrammatical. For example:
- Grammatical: Grace’s piano teachers are known.
- Ungrammatical: Grace’s piano teachers are replied.
The model should assign a higher probability to the grammatical sentence. However, the authors hypothesized that if the ungrammatical sentence contains more frequent words than the grammatical one, the model might get confused.
They developed a pipeline to quantify this:
- Calculate the frequency difference between the tokens in the pair.
- Isolate pairs where the grammatical sentence uses rare words and the ungrammatical sentence uses frequent words.
- Compare the model’s accuracy on these “hard” pairs vs. pairs where the frequency helps the model.

As shown in Figure 1, the difference in accuracy between the top third (frequency helps) and bottom third (frequency hurts) is the Frequency Bias. A high bias score means the model is cheating: it’s looking at word frequency rather than grammar.
The Core Method: Syntactic Smoothing
The standard way to train a Language Model is to use a “one-hot” target distribution. If the correct next word is “cat,” the target vector has a 1.0 for “cat” and 0.0 for every other word in the dictionary. The model is penalized if it predicts anything other than “cat.”
The problem? If the correct word is a rare word (e.g., “ocelot”), the model barely gets any chances to learn it.
The authors propose Syntactic Smoothing. Instead of putting 100% of the target probability on the correct word, we distribute some of that probability to other words that play the same syntactic role.
If the target is “ocelot,” we shouldn’t just tell the model “it’s ocelot.” We should tell it: “It’s ocelot, but it’s also similar to ‘cat’, ’tiger’, and ‘animal’.” This allows the rare word “ocelot” to benefit from the learning signals of frequent words like “cat.”
Step 1: Defining Syntactic Similarity
To do this, we need to know which words are syntactically similar. The authors use a Part-of-Speech (POS) proxy.
They run a POS tagger over the training data and build a distribution for every token. For example, the word “blind” can be a Noun, Verb, or Adjective. The word “the” is almost exclusively a Determiner.
We can visualize this difference in distributions:

They then calculate the Syntactic Similarity between any two tokens \(i\) and \(j\) by taking the cosine similarity of their POS distribution vectors \(M\):

This creates a static matrix telling us how grammatically similar every word is to every other word.
Step 2: Smoothing the Backpropagation Signal
Now, the authors modify the loss function. They introduce a smoothing parameter \(\alpha\).
- If \(\alpha = 0\), we do standard training (100% signal to the correct word).
- If \(\alpha > 0\), we reserve \((1 - \alpha)\) for the correct word, and distribute the remaining \(\alpha\) to all other words based on their syntactic similarity.
The new target distribution \(t_i\) for a vocabulary token \(i\) (given the correct target is \(j\)) is:

To make the distribution sharper (so we don’t smear probability over too many words), they apply temperature scaling to the similarity scores:

Step 3: Pacing
Giving “partial credit” to syntactically similar words is great for learning representations, but eventually, the model needs to learn to predict the specific correct word to be useful.
The authors introduce Pacing: they start training with a high value of \(\alpha\) (lots of smoothing) and linearly decrease it to 0 by the end of training. This serves as a “syntactic scaffold”—supporting the model early on and removing the support as the model matures.
Experiments and Results
The researchers trained small RoBERTa models (approx. 125M parameters) on the BabyLM dataset (10 million tokens). They compared their Syntactic Smoothing (SyS) models against:
- Base Model: Standard RoBERTa training.
- Label Smoothing (LS): A standard technique where the signal is distributed uniformly across all other tokens, not based on syntax.
1. Reducing Frequency Bias
The primary goal was to stop the model from blindly trusting frequent words.

As seen in Figure 3, standard models (OPT, RoBERTa, T5) have high frequency bias (gray bars). The authors’ Base Model (brown) also has high bias (~9.8). However, the Syntactic Smoothing (SyS) variants (red bars) drastically reduce this bias, with the “Mid” smoothing level bringing it almost to zero. This confirms that the model is learning to rely less on raw frequency statistics.
2. Reducing Anisotropy
Did the geometric “cone” problem improve?

Figure 4 plots anisotropy over the course of training. The Base Model (dashed brown line) ends up with high anisotropy (~0.5). The Syntactic Smoothing models (solid lines) consistently maintain lower anisotropy.
Ideally, we want representations to be spread out. The Paced (SyS-P) models, which slowly turn off the smoothing, showed interesting behavior: even as the smoothing was removed, the anisotropy remained lower than the baseline.
We can zoom in on specific layers to see this effect more clearly:

Figure 5 compares the Base Model vs. the Paced Syntactic Smoothing model. Notice the Layer 7 (solid lines). In the Base Model (black), anisotropy skyrockets. In the SyS model (orange), it stays much lower. This suggests the final representations are much richer and distinct from one another.
3. The Link Between Bias and Anisotropy
One of the most interesting findings in the paper is the correlation between these two phenomena.

Figure 6 shows that models with high frequency bias tend to have high anisotropy. The Syntactic Smoothing models (red/orange markers) cluster in the bottom-left, indicating they have successfully mitigated both issues simultaneously. This supports the theory that “degenerate” representations of rare words are a root cause of frequency bias.
4. Alternative Similarity Metrics
You might wonder: is Cosine Similarity the best way to compare POS distributions? The authors also tested Jensen-Shannon (JS) Divergence as an alternative distance metric.

The results were robust regardless of the metric used. As shown in Table 2 below, using JS divergence (SyS [JS]) achieved similar reductions in bias and anisotropy compared to the cosine version.

Conclusion
The “rich get richer” dynamic of token frequency in Large Language Models is a significant hurdle for true language understanding. When models ignore rare words in favor of common ones, they fail to capture the long tail of human language.
This paper provides a compelling solution: Syntactic Smoothing. By baking a “syntactic prior” into the loss function, the researchers forced the model to group rare words with their frequent syntactic cousins.
Key Takeaways:
- Rare words matter: Standard training leaves rare words in a “degenerate” state (Anisotropy).
- Smoothing helps: Distributing the learning signal based on POS tags rescues these rare words.
- Bias and Geometry are linked: Reducing the geometric clustering of vectors (Anisotropy) directly reduces the model’s tendency to cheat using frequency (Frequency Bias).
While this study focused on smaller models and English data, the implications are broad. As we strive for models that don’t just memorize statistics but actually understand structure, linguistically informed training objectives like Syntactic Smoothing may become essential tools in the deep learning toolbox.
](https://deep-paper.org/en/paper/2410.11462/images/cover.png)