As Large Language Models (LLMs) become central to our digital interactions, the question of “safety” has moved from a theoretical concern to a practical necessity. We rely on these models not just to chat, but increasingly to evaluate the safety of other systems. This creates a recursive loop: AI is being used to police AI.

But this raises a fundamental question: Do LLMs actually understand safety the way humans do?

Safety is not an objective truth like mathematics; it is a social construct deeply influenced by culture, demographics, and personal experience. A conversation that seems benign to one group might be deeply offensive to another. If we use LLMs to automate safety labelling, are we baking in a specific worldview? Are we erasing minority perspectives?

In this post, we will dive into a recent paper titled “Annotation alignment: Comparing LLM and human annotations of conversational safety” by researchers at Cornell Tech and the University of Washington. We will explore how well LLMs align with human perceptions of safety, whether they exhibit demographic biases in their agreement, and if they possess the “social intelligence” to predict when different human groups will disagree.

The Problem of Subjectivity

To understand the core challenge, we first need to look at how safety data is created. Usually, human annotators read conversations and label them as “safe” or “unsafe.” However, humans rarely agree 100% of the time.

As illustrated in the infographic below, the researchers utilized the DICES dataset. This dataset is unique because it doesn’t just provide a single “gold standard” label. Instead, it captures the messiness of human consensus. 350 conversations were rated by 112 diverse annotators.

Figure 1: Human annotators disagree about what constitutes a safe chatbot (left). We study three questions around whether LLM annotators capture human annotation diversity (right): we measure safety annotation alignment with the average of 112 humans (RQ1) and with different annotator demographic groups (RQ2), and we evaluate whether GPT-4 can predict when one group rates a conversation more unsafe than another (RQ3).

As shown in Figure 1, the study breaks this problem down into three Research Questions (RQ):

  1. RQ1: Does the LLM agree with the “average” human?
  2. RQ2: Does the LLM agree more with specific demographic groups (e.g., White men vs. Black women)?
  3. RQ3: Can the LLM predict when these groups will disagree with each other?

Background: The DICES Dataset and Methodology

Before analyzing the results, let’s understand the data. The DICES-350 dataset consists of multi-turn conversations where crowdworkers deliberately tried to provoke the LaMDA chatbot into discussing sensitive topics (like race, gender, or politics).

Each conversation was then rated by 112 distinct annotators. These annotators checked for five specific criteria:

  1. Harm
  2. Unfair bias
  3. Misinformation
  4. Political affiliation
  5. Increased polarization

If an annotator flagged any of these, the conversation was marked as “Unsafe” (1). Otherwise, it was “Safe” (0).

The AI Annotators

The researchers tested five leading models: GPT-3.5, GPT-4, GPT-4o, Gemini 1.5 Pro, and Llama 3.1 405B.

To get the best possible performance out of these models, the researchers didn’t just ask “Is this safe?” They used a prompting strategy called Analyze-Rate. This is a Chain-of-Thought (CoT) technique where the model is instructed to first analyze the conversation according to the safety criteria and then provide a rating on a Likert scale (1-5). This forces the model to “show its work,” which typically improves the quality of the final judgment.

RQ1: The “Average” Human vs. The Machine

The first test was straightforward: How well does an LLM’s safety rating correlate with the consensus of the 112 human annotators?

The metric used here is the Pearson correlation (\(r\)). A score of 1.0 means perfect alignment, 0 means no correlation, and negative scores imply opposite judgments.

The results, presented in the table below, show a surprising victory for the models.

Table 1: Pearson correlations between LLM safety ratings and the average of the 112 annotators’ safety ratings.

As we can see in Table 1, GPT-4 achieves a correlation of \(r = 0.61\) (using the Analyze-Rate prompt). Llama 3.1 follows closely behind.

To put this number in perspective, the researchers calculated the correlation of the median individual human annotator against the group average. The median human only achieved an \(r = 0.51\).

Key Takeaway: GPT-4 is actually a better proxy for the “average human view” than a random single human is. If you had to pick one judge to represent the group consensus, you would statistically be better off choosing GPT-4 than picking one person at random from the crowd.

The Nature of Disagreement

While the correlation is strong, it isn’t perfect. The researchers performed a qualitative analysis to see why the models and humans disagreed.

  1. Where Humans are Stricter: Humans tended to rate conversations as “Unsafe” when the chatbot gave advice (medical, legal, or relationship). Even benign advice like “you should communicate with your partner” was often flagged by humans. GPT-4, conversely, often rated these as safe.
  2. Where Models are Stricter: GPT-4 was much more sensitive to hate speech and bias. In cases where a chatbot responded neutrally to a hateful user comment (e.g., “I see you feel that way”), humans often let it slide. GPT-4 marked these as unsafe, likely enforcing a higher standard that requires the chatbot to actively denounce hate speech.

RQ2: The Demographic Alignment Question

This is perhaps the most critical section of the paper regarding ethics. We know that different demographic groups perceive safety differently. For example, prior work on the DICES dataset showed that White male annotators were generally more likely to rate conversations as “safe” compared to other groups.

The fear is that LLMs might be “aligned” specifically with the majority demographic (often White, Western perspectives), thereby failing to recognize harms against marginalized groups.

To test this, the researchers calculated the correlation between GPT-4’s ratings and specific race-gender subgroups (e.g., Black Female, Asian Male, White Female).

They used a statistical technique involving null distributions (shown in grey in the chart below) to see if the correlation with any specific group was statistically significant, or just random noise.

Figure 2: GPT-4 does not align significantly more or less with different race-gender groups.

Figure 2 visualizes these results. The green dots represent the actual correlation between GPT-4 and a specific demographic group. The grey bars represent the “expected” range of correlations if you were to just grab a random group of people of the same size.

The Result: All the green dots fall within the grey bars (the 99% Confidence Intervals).

The Interpretation: This finding is nuanced. It does not prove that GPT-4 is perfectly neutral. Rather, it shows that the dataset is underpowered to detect the difference. The variations in agreement are not statistically significant enough to claim that GPT-4 prefers White annotators over Black annotators, or vice versa.

Furthermore, the researchers found that idiosyncratic variation (differences between individuals) was massive. The variation within a demographic group was often larger than the variation between groups. This suggests that simply knowing someone is an “Asian Male” or “Latinx Female” tells you very little about how well they will agree with an AI. Personal values and context matter much more than broad demographic labels.

RQ3: Can AI Predict Group Disagreements?

Even if the AI doesn’t perfectly align with one group, can it understand that groups differ? This concept relates to “Pluralism”—the ability of a model to reflect diverse viewpoints.

The researchers set up a “paired disagreement” task. They specifically looked for conversations where one demographic group (Group A) rated the conversation as significantly more unsafe than another group (Group B).

Mathematically, they defined a set of “high disagreement” conversations (\(\mathcal{D}\)) where the difference in mean ratings between two groups was greater than 0.2:

Equation defining the set of high disagreement conversations where the difference in group means is greater than 0.2.

Conversely, they defined a set of “low disagreement” or neutral conversations (\(\mathcal{A}\)) where the groups largely agreed (difference less than 0.03):

Equation defining the set of low disagreement conversations where the difference in group means is less than 0.03.

The test was simple: Can GPT-4 look at a conversation and predict which group would be more offended? If the model understands demographic nuances, it should predict higher disagreement scores for the conversations in set \(\mathcal{D}\) compared to set \(\mathcal{A}\).

The researchers calculated the difference in the model’s predictions for these two sets:

Equation comparing the mean GPT predictions for the high disagreement set versus the low disagreement set.

The Verdict: The result of this calculation was effectively zero.

GPT-4 failed to predict when demographic groups would disagree. It could not identify conversations that marginalized groups found unsafe but White annotators found safe. This indicates a significant gap in the model’s “Theory of Mind” regarding cultural safety norms. It treats safety as a monolith rather than a pluralistic concept.

Conclusion and Implications

This research provides a mixed report card for current LLMs.

On the positive side, models like GPT-4 and Llama 3.1 have become highly capable annotators. They align with the “average” human consensus better than most individual humans do. This validates the current industry practice of using LLMs to evaluate other LLMs (like the “Constitutional AI” approach used by Anthropic).

However, the limitations are stark:

  1. Blindness to Disagreement: The model cannot predict when safety is contested. It acts as if there is only one “correct” safety score, failing to capture the reality that different communities have different boundaries.
  2. Data Limitations: We still don’t have enough high-quality data to firmly prove or disprove demographic alignment bias. The researchers emphasize that future datasets need to be larger and capture attributes beyond just race and gender—perhaps focusing on political values or past experiences with online harm.

For students and practitioners entering the field of AI safety, this paper highlights a crucial lesson: High correlation with the “average” is not the end of the story. A model can be statistically accurate on average while still completely failing to understand the specific safety needs of minority groups. As we build the next generation of conversational agents, moving from “average alignment” to “pluralistic alignment” will be the next great challenge.