Beyond Binary Bias: Benchmarking Fairness in LLMs with the SOFA Framework

The capabilities of Large Language Models (LLMs) have exploded in recent years. These models can write code, compose poetry, and summarize complex documents. However, because they are trained on vast swathes of the internet, they also ingest the prejudices, stereotypes, and discriminatory attitudes present in that data. This phenomenon, known as social bias, is not just a theoretical problem; it manifests in downstream tasks, potentially leading to automated systems that treat individuals unfairly based on their gender, religion, disability, or nationality.

For students and researchers in NLP, measuring this bias is a critical challenge. Historically, we have relied on benchmarks that treat bias as a binary problem—choosing between a stereotype and an anti-stereotype. But is bias really that simple?

In this post, we will dive deep into a paper titled “Social Bias Probing: Fairness Benchmarking for Language Models.” The researchers propose a new framework and a massive dataset called SOFA (SOcial FAirness). Their approach moves beyond simple binary choices to analyze “disparate treatment”—how models vary their behavior across a wide spectrum of identities.

The Problem with Current Benchmarks

Before we look at the solution, we must understand the limitations of the current standard. Popular benchmarks like CrowS-Pairs and StereoSet have been pioneering, but they operate on a specific premise: they test whether a model prefers a stereotypical sentence over an anti-stereotypical one.

For example, a benchmark might present a model with two sentences:

“The doctor bought himself a bagel.” (Stereotype: Doctors are male)
“The doctor bought herself a bagel.” (Anti-stereotype: Doctors are female)

If the model assigns a higher probability to the first sentence, it is penalized. While useful, this approach has flaws:

Binary Limits: It assumes a singular “ground truth” and usually only compares two groups (e.g., Male vs. Female), ignoring non-binary identities or complex cultural groups.
Thresholding: These benchmarks often use a 50% threshold. If a model picks the stereotype 51% of the time, it is “biased”; otherwise, it is “fair.” This creates a false dichotomy that masks the severity or subtlety of the bias.

The authors of this paper argue that social bias is too complex for a binary test. We need to measure disparate treatment: the variation in how a model predicts text when the demographic group changes, covering a much wider range of identities.

The researchers introduce the Social Bias Probing Framework. The core idea is to subject an LLM to a standardized set of “probes” and measure how “surprised” the model is by different identities associated with harmful stereotypes.

As illustrated in the flowchart below, the process involves two main stages: Probe Generation and Probe Evaluation.

Figure 1: Social Bias Probing framework.

1. Probe Generation and the SOFA Dataset

To facilitate this framework, the authors curated SOFA, a large-scale benchmark that addresses the data scarcity of previous attempts.

They started with stereotypes from the Social Bias Inference Corpus (SBIC), which contains biased social media posts. They stripped the specific subjects from these sentences to create a list of “stereotype templates” (e.g., “…are all terrorists”).

Next, they took a lexicon of identities spanning four categories:

Religion (e.g., Catholics, Buddhists, Atheists)
Gender (e.g., Men, Women, Trans men)
Disability (e.g., Amputees, Cognitively disabled people)
Nationality (e.g., Americans, Nigerians, Chinese)

By computing the Cartesian product of these identities and stereotypes, they generated over 1.49 million probes. This scale allows for a statistical analysis that is far more robust than previous datasets, which typically contained only a few thousand examples.

Table 5: Comparison of dataset sizes.

As shown in the table above, SOFA dwarfs existing benchmarks like StereoSet and CrowS-Pairs in terms of the number of identities and total probes, offering a much higher resolution picture of model behavior.

The Core Method: Measuring Bias with Perplexity

How do we mathematically measure if a model is biased using these probes? The authors rely on Perplexity (PPL).

In simple terms, perplexity measures how well a probability model predicts a sample. A low perplexity indicates the model is not surprised by the sequence of words (it considers the sequence “likely”), while a high perplexity indicates the model finds the sequence unlikely.

The formula for perplexity for a tokenized sequence \(X\) is:

Equation for Perplexity.

If a model has a social bias, it will have a lower perplexity (higher likelihood) for sentences that align with its learned stereotypes. For example, if a model has encoded Islamophobic biases, it will assign a lower perplexity to “Muslims are terrorists” compared to “Buddhists are terrorists.”

Normalization

However, we cannot simply compare raw perplexity scores. Some words are just rarer than others. The word “Buddhists” might appear less frequently in the training data than “Men,” intrinsically affecting the perplexity.

To solve this, the authors calculate a Normalized Perplexity (\(PPL^*\)). They divide the perplexity of the full sentence (identity + stereotype) by the perplexity of the identity alone.

Equation for Normalized Perplexity.

This isolates the model’s association between the specific group and the stereotype, removing the “noise” caused by the frequency of the group name itself.

The Metric: Variance as Unfairness

The central thesis of the paper is based on invariance. In a perfectly fair model, the statement “X are all terrorists” should be equally unlikely regardless of who X is.

Therefore, to measure bias, the researchers calculate the variance of the normalized perplexity scores across all identities in a category.

Equation for Variance of Log Perplexity.

If the variance is high, it means the model treats different groups very differently (high disparate treatment). If the variance is low (close to zero), the model treats all groups roughly the same regarding that stereotype.

The Delta Disparity Score (DDS)

In addition to variance, they introduce the Delta Disparity Score (DDS). This simple metric looks at the worst-case scenario for a specific stereotype: the difference between the maximum and minimum perplexity scores found within a category.

Equation for Delta Disparity Score.

A high DDS indicates a massive gap between the most favored and least favored group for a specific stereotypical statement.

Experiments and Key Results

The authors tested five major families of language models: BLOOM, GPT-2, XLNet, BART, and LLaMA-2. They analyzed two different sizes for each model to see if model size impacted bias.

1. Ranking the Models

When the authors ranked the models based on the SOFA score (average variance), they found a stark disagreement with previous benchmarks.

Table 1: Results on SoFA and previous benchmarks.

Notice the “Rank” columns. A model like LLAMA2, which ranks highly (fairly) on CrowS-Pairs (Rank 1 & 2), ranks much lower on SOFA (Rank 5 & 6).

This suggests that SOFA is capturing a dimension of bias that binary association tests miss. The high agreement between StereoSet and CrowS-Pairs (Kendall’s Tau of 0.911) contrasts sharply with their disagreement with SOFA, indicating that we have been looking at only a slice of the problem until now.

2. Which Categories are Most Biased?

One of the most surprising findings came from breaking down the bias by social category. In the NLP community, significant effort has gone into mitigating gender and racial bias. But what about religion?

Table 2: SOFA score by category.

As Table 2 shows, Religion consistently yields the highest SOFA scores (highest variance) across almost all models. For example, BLOOM-560m has a variance of 3.216 for religion, compared to just 1.292 for nationality.

This suggests a “blind spot” in current safety training. While models have been fine-tuned to avoid gender and racial slurs, religious biases remain deeply encoded.

We can visualize this aggregate scoring in the stacked bar chart below. The large distinct blocks for specific models in the “Religion” column highlight how much variability exists there compared to “Nationality.”

Figure 7: Stacked SOFA scores by category.

3. Identifying Targets and Harmful Stereotypes

The framework also allows for a granular look at who is being targeted. By looking at which identities generate the lowest perplexity (i.e., the model thinks the stereotype fits them best), the researchers found concerning patterns.

Figure 2: Percentage of probes where identity is most associated with stereotypes.

In the top-left plot (Religion), specific groups like Muslims and Jews are frequently the identities most strongly associated with negative stereotypes. In the Gender category (bottom-left), Trans men/women often trigger high associations with stereotypes.

The researchers also looked at which specific stereotypes generated the lowest DDS (meaning the model consistently applied these stereotypes across the board, or the bias was most “agreed upon” by the model weights).

Figure 3: Stereotypes with lowest DDS.

The content here is grim but revealing. For gender, the models reflect real-world issues regarding sexual violence. For disability, the models encode judgments about appearance and capacity. This confirms that LLMs act as a mirror, reflecting the real-life adversities and prejudices faced by marginalized groups in the training data.

Conclusion and Implications

The “Social Bias Probing” paper fundamentally challenges how we benchmark fairness. By moving from a binary “biased/unbiased” check to a variance-based approach, the authors reveal that our models are more biased than we thought, particularly regarding religion.

Key Takeaways:

Complexity: Bias is not a toggle switch. It is a complex distribution of probabilities across many identities.
The Religious Gap: There is an urgent need to address religious bias in LLMs, which appears to be lagging behind gender and racial fairness efforts.
Real-world Reflection: Models faithfully reproduce the specific types of harms (e.g., ableism, misogyny) present in society, requiring active mitigation strategies beyond simple filtering.

The introduction of SOFA provides the community with a powerful new tool (1.5 million probes strong) to diagnose these issues. As we continue to deploy LLMs in sensitive areas like hiring, healthcare, and education, using granular, high-resolution benchmarks like this will be essential to ensuring AI works fairly for everyone.

The Problem with Current Benchmarks#

The Solution: The Social Bias Probing Framework#

1. Probe Generation and the SOFA Dataset#

The Core Method: Measuring Bias with Perplexity#

Normalization#

The Metric: Variance as Unfairness#

The Delta Disparity Score (DDS)#

Experiments and Key Results#

1. Ranking the Models#

2. Which Categories are Most Biased?#

3. Identifying Targets and Harmful Stereotypes#

Conclusion and Implications#