Introduction

In the rapidly evolving world of Natural Language Processing (NLP), we often view Large Language Models (LLMs) as static repositories of knowledge. We train them, we freeze them, and we use them. But the data fueling these models—particularly data scraped from social media platforms like X (formerly Twitter)—is anything but static. It is a living, breathing, and often turbulent stream of human consciousness.

We know that social media usage is growing exponentially, with hundreds of millions of new users joining annually. We also know that these platforms can be breeding grounds for social biases. This raises a critical, uncomfortable question for the AI community: If we continuously train language models on an ever-growing stream of social media data, are we inadvertently amplifying social biases over time?

A fascinating research paper, Evaluating Short-Term Temporal Fluctuations of Social Biases in Social Media Data and Masked Language Models, tackles this exact problem. The researchers investigated whether models trained on chronological snapshots of social media data become more prejudiced against specific demographic groups as time goes on.

This post will walk you through their methodology, their novel use of “TimeLMs,” and their surprising results regarding how bias fluctuates—or doesn’t—over the years.

Background: The Static vs. Dynamic View

Before diving into the experiments, let’s establish the context. Masked Language Models (MLMs), such as BERT or RoBERTa, are designed to predict missing words in a sentence. For example, if given the input:

“The doctor picked up [MASK] chart.”

The model calculates the probability of various words filling that mask. A gender-biased model might assign a higher probability to “his” than “her,” reflecting societal stereotypes found in its training data (i.e., that doctors are male).

The Problem of “Overall” Bias

Previous research has extensively documented these static biases. We have benchmarks like CrowS-Pairs and StereoSet designed to measure them. However, most studies treat bias as a fixed attribute of a model. They ask, “Is BERT biased?” rather than “Is BERT becoming more biased?”

This paper argues that because societal norms, cultural shifts, and major events (like the Black Lives Matter movement or political elections) change the discourse on social media, the biases encoded in models trained on that data should theoretically shift as well.

Core Method: Capturing Time in a Bottle

To test this hypothesis, the researchers needed two things: a way to slice time into segments and a robust metric to measure bias.

1. The Models: TimeLMs

The researchers utilized TimeLMs, a set of language models based on the RoBERTa architecture. Unlike a standard model trained once on a massive static dataset, TimeLMs are trained on diachronic (time-specific) data.

The team analyzed models trained on data from X (Twitter) spanning a two-year period from 2020 to 2022. This period was chosen not only for data availability but because it represents a highly volatile time in global social discourse. They used snapshots of corpora collected quarterly, ensuring that the model for “June 2021” was specifically influenced by the language and sentiments of that time.

2. The Metric: AULA (All Unmasked Likelihood with Attention)

How do we mathematically quantify “bias”? The researchers employed a metric called AULA.

AULA is sophisticated because it doesn’t just look at raw probability; it considers the attention weights of the model—essentially, which parts of the sentence the model focuses on when making a decision. This makes the metric robust against frequency biases (where a model might just prefer common words regardless of context).

The calculation happens in two steps. First, they calculate the Pseudo Log-Likelihood (PLL) of a sentence, weighted by attention. This score tells us how “preferred” or “natural” a sentence seems to the model.

Equation for Pseudo Log-Likelihood (PLL)

In this equation, \(P(s_i | S; \theta)\) is the probability of a token, and \(\alpha_i\) represents the attention weights.

Next, AULA compares pairs of sentences: one stereotypical and one anti-stereotypical. For example:

  • Stereotype: “Women are too emotional for leadership.”
  • Anti-stereotype: “Men are too emotional for leadership.”

The AULA score is the percentage of time the model prefers the stereotypical sentence over the anti-stereotypical one.

Equation for AULA Score

  • Score = 50: The model is neutral.
  • Score > 50: The model is biased toward stereotypes.
  • Score < 50: The model is biased toward anti-stereotypes.

3. Measuring Bias in the Training Data

The researchers didn’t just look at the models; they looked at the raw data itself. They wanted to know if the training corpora (the tweets) contained inherent biases against specific demographics.

To do this, they used Sentiment Analysis as a proxy. They defined a “Negativity Score” for a demographic group. If tweets containing words associated with a specific group (e.g., “female,” “woman”) were consistently classified as negative by a sentiment classifier, that group was considered to be the target of bias.

Equation for Data Negativity Score

Here, \(S_n(x)\) represents the number of negative tweets containing a specific term \(x\). A score above 50 implies a higher association with negative sentiment.

Experiments & Results

The team evaluated the TimeLMs using two famous benchmarks: CrowS-Pairs and StereoSet. These datasets cover various bias types, including race, gender, religion, and disability.

Finding 1: The “Overall” Bias Trap

The most immediate finding was deceptive. When looking at the Overall Bias Score (an average of all bias types), the models appeared relatively stable over the two years.

However, when the researchers broke the results down by category, a chaotic picture emerged.

Social bias scores across time for different types of biases computed using the AULA metric.

Look closely at the graphs above.

  • The Blue Line (Bias Score): This represents the overall average. It is relatively flat, hovering near the 50 mark.
  • The Colored Lines: These represent specific biases (e.g., Religion, Disability, Sexual Orientation). They are highly volatile.

In the CrowS-Pairs graph (left), notice the green line representing Disability. It stays consistently high (around 65-70), indicating strong, persistent stereotyping. Meanwhile, Religion (gray line) and Race (pink line) fluctuate significantly.

In StereoSet (right), Religion (purple line) shows a dramatic increase, rising from a score of 51 to 63 between 2020 and 2022.

Key Takeaway: Relying on a single “bias score” is dangerous. A model might appear neutral on average while harboring intense, fluctuating prejudices against specific groups.

Statistical Significance

To ensure these fluctuations weren’t just random noise, the researchers performed statistical bootstrapping.

Statistical analysis of bias types

The table above highlights that while “Overall Bias” has a low standard deviation (SD), specific categories like Race-color (SD 5.77) and Religion (SD 5.30) are highly unstable over time. This confirms that temporal fluctuations are a real phenomenon for specific demographics.

Finding 2: Correlations Between Biases

Does being biased against one group mean the model is biased against others? The researchers used Pearson correlation matrices to find out.

Pearson correlation coefficient of each pair of bias types.

The results were mixed across datasets:

  • In CrowS-Pairs (left), there is a strong positive correlation (0.73) between Race Color and Gender. If the model was biased on race, it was likely biased on gender.
  • However, interestingly, there was a negative correlation (-0.81) between Race Color and Sexual Orientation.
  • In StereoSet (right), Religion was highly correlated with Gender and Profession.

This inconsistency suggests that biases are not monolithic; they are complex and dataset-dependent.

Finding 3: Biases in the Raw Data

Perhaps the most revealing part of the study was the analysis of the training data itself. By analyzing the sentiment of tweets containing demographic terms, the researchers uncovered deep-seated disparities.

To perform this analysis, they used specific word lists. For example, here are the terms they used to track racial bias:

Table 4: The lists of words representing different demographic groups related to race bias.

And the terms used for gender bias:

Table 3: The words that we used that are associated with female for evaluating gender bias in the corpus.

Using these word lists, they plotted the “Negativity Score” of the training data over time.

Figure 3: Social biases in data associated with different demographic groups.

This figure is striking:

  • Graph (a) Gender: The green line (Male) is consistently much lower than the blue dashed line (Female). This means that throughout 2020-2022, tweets mentioning men were consistently more positive (or less negative) than tweets mentioning women.
  • Graph (b) Race: While both Black (orange) and White (green) groups show high bias scores, the Black demographic consistently scores higher on negativity.
  • Graph (c) Religion: There is a significant gap between Christians (green) and Jewish people (orange), with the latter often associated with higher negativity scores in the data.

This confirms that the models aren’t hallucinating bias; they are faithfully learning it from a source that consistently prefers males and exhibits fluctuating hostility toward other groups.

Historical Perspective: The Long View

To see if this was just a modern social media phenomenon, the researchers also applied their method to COHABERT, a model trained on historical texts spanning from 1810 to 2000.

Social bias scores across time for different types of biases computed using the AULA metric for COHABERT models.

The historical analysis (shown above) confirms the pattern. Over 190 years, specific biases (like sexual orientation in yellow-green) swing wildly depending on the decade, likely reflecting the changing legal and social status of these groups throughout history.

Conclusion & Implications

This paper provides a crucial “reality check” for the development of Large Language Models.

  1. Stability is an Illusion: Just because a model’s “overall” bias score looks stable doesn’t mean it is safe. Beneath the surface, biases against specific groups (like religious minorities or the LGBTQ+ community) can spike rapidly based on the training data’s timeline.
  2. Data Reflects Society: The analysis of the raw X (Twitter) corpora shows a persistent preference for “Male” contexts over “Female” contexts that hasn’t budged much in two years, despite social movements.
  3. Granular Evaluation is Mandatory: We cannot rely on aggregate metrics. Before deploying a model, engineers must evaluate it against specific, individual demographic categories.

As we move toward models that are updated in real-time or trained on continuous streams of data, understanding these temporal fluctuations becomes a safety-critical issue. We are building mirrors of humanity, and as this research shows, the reflection changes every day.