Imagine you are scrolling through a social media feed and you encounter a comment about a sensitive political topic. You might shrug it off as a harmless opinion. Your friend, however, might find it deeply offensive. Now, imagine a third person reading that same comment from a cafe in Cairo, a subway in Tokyo, or a living room in São Paulo.

Would they agree on whether that sentence is “toxic”?

For years, the field of Natural Language Processing (NLP) has treated tasks like offensive language detection as objective problems with a single “ground truth.” We collect labels from annotators, take the majority vote, and train an AI to predict that label. But what if the disagreement between annotators isn’t noise to be filtered out? What if it’s the most important signal we have?

In the paper “D3CODE: Disentangling Disagreements in Data across Cultures on Offensiveness Detection and Evaluation,” researchers from Google Research and the DAIR Institute challenge the assumption that there is a universal standard for offensiveness. They introduce a massive new dataset that maps how our cultural backgrounds and internal moral compasses dictate what we find unacceptable.

For students of AI and data science, this paper is a masterclass in why “who” annotates your data matters just as much as “what” they are annotating.

Figure 1: The distribution of labels provided from different countries. Annotators from China, Brazil, and Egypt provided significantly different labels.

The Subjectivity Problem in NLP

Before diving into the new method, let’s establish the context. Content moderation systems—the AI filters that hide toxic comments on platforms like YouTube or Instagram—rely on supervised learning. This means humans read thousands of comments and label them (e.g., “Toxic” or “Not Toxic”).

Historically, researchers have ignored the identity of these human annotators. If three people say a comment is toxic and two say it isn’t, the comment is labeled “toxic,” and the two dissenting voices are treated as errors.

However, recent studies have shown that these “errors” often fall along demographic lines. What a 50-year-old man finds acceptable might be deeply hurtful to a 20-year-old woman. Furthermore, most prior research has been “WEIRD”—focused on Western, Educated, Industrialized, Rich, and Democratic populations.

The D3CODE paper argues that to build truly global AI, we need to look beyond simple demographics like age and gender. We need to understand Culture (the social norms of where we live) and Morality (our internal values).

The D3CODE Dataset: A Global Undertaking

To capture this complexity, the authors undertook a massive data collection effort. They didn’t just recruit random people from the internet; they curated a pool of 4,309 annotators across 21 countries and 8 geo-cultural regions.

A Diverse Annotator Pool

The researchers moved beyond the standard “North America vs. Europe” comparison. They actively recruited from the Arab Culture, Indian Cultural Sphere, Sinosphere (East Asia), Latin America, Sub-Saharan Africa, and Oceania.

As shown in the table below, they ensured a robust sample size (~500+ people) for each region, while also balancing for gender and age. This level of granularity allows us to ask specific questions: Does a young person in the “Sinosphere” view authority differently than an older person in “Western Europe”?

Table 1: Demographic distribution of annotators from each region.

Defining the Regions

Defining culture is difficult. To make the data manageable, the authors grouped countries into broader regions based on cultural similarities (loosely based on UN groupings). This provides a framework to analyze macro-level cultural trends.

Table 5: List of regions and countries within them in our dataset.

The Annotation Task

The participants were asked to annotate social media comments selected from the famous Jigsaw Toxicity dataset. These weren’t random comments; the researchers specifically chose sentences that were known to be difficult or controversial. The dataset was split into three types of content:

  1. Random: Items known to cause disagreement.
  2. Moral Sentiment: Sentences that trigger moral reasoning.
  3. Social Group Mentions: Comments mentioning race, religion, gender, or sexual orientation.

Crucially, the annotators didn’t just rate the sentences. They also took a psychological survey.

The “Why”: Integrating Moral Foundations Theory

This is where the paper innovates significantly. Collecting demographic data (age, country) is standard. Collecting psychological profiles is rare.

The authors used the Moral Foundations Questionnaire (MFQ-2) to profile every annotator. Moral Foundations Theory suggests that human morality isn’t one-dimensional; it rests on several pillars:

  • Care: Protecting others from harm.
  • Equality: Treating people equally.
  • Proportionality: Rewarding people based on merit.
  • Authority: Respecting tradition and hierarchy.
  • Loyalty: Standing with your group/family.
  • Purity: Avoiding spiritual or physical degradation.

By asking annotators questions like the one below, the researchers could calculate a “moral profile” for each person.

Figure 5: Sample of MFQ-2 questions in our survey

This allows the analysis to move from “Annotators in India disagreed with annotators in the US” to “Annotators who value Purity disagreed with annotators who value Care.”

Analysis: What Drives Disagreement?

The results of the study expose just how varied human perception is. The researchers found that disagreement isn’t random—it’s systematic and driven by identity.

1. The “I Don’t Understand” Phenomenon

One of the most overlooked aspects of data annotation is confusion. Usually, if an annotator skips a question, it’s discarded. In D3CODE, the researchers analyzed who was skipping questions.

They found distinct patterns in who selected “I do not understand this message.”

  • Gender: Women and non-binary individuals were more likely to admit they didn’t understand a message than men.
  • Age: People over 50 were significantly more likely to not understand the social media comments (which often contain slang or internet-specific context).
  • Region: Interestingly, native English speakers (in Oceania, North America, UK) were more likely to mark messages as confusing than participants from regions like the Arab Culture or Sub-Saharan Africa.

Figure 2: The likelihood of an annotator not understanding the message, grouped based on their sociodemographic information.

This suggests that uncertainty is not evenly distributed. If we simply delete “confused” responses, we might be systematically silencing the perspectives of older populations or specific genders, biasing the model toward the understanding of young men.

2. Moral Values Cross Borders

When the researchers clustered annotators based on their MFQ-2 scores, they found that moral values don’t perfectly align with borders. While certain regions lean toward specific values, you can find people with similar moral profiles all over the world.

The chart below shows how participants from different regions fall into different “Moral Clusters.” For example, Cluster 0 (the bottom bar in each stack) consists of people who reported very high agreement with all moral foundations. This cluster is heavily populated by participants from the Indian Cultural Sphere (red) and Arab Culture (blue).

In contrast, other clusters with different moral priorities have higher representations from Western Europe or North America. This proves that while region is a good proxy for culture, looking at values gives us a much higher-resolution picture of the annotator.

Figure 3b: Distribution of participants from different regions across different moral clusters.

3. The Content Matters: Social Groups Spark Conflict

The study found that we don’t disagree equally on all topics. The researchers compared disagreement levels across the three content types: Random, Moral, and Social Group Mentions.

The data reveals that Social Group Mentions (comments about religion, race, sexuality, etc.) create the highest level of disagreement between regions.

In the visualization below, look at plot (a). The red line represents items mentioning social groups. It is shifted to the right, indicating higher cross-regional disagreement compared to random items (blue).

Plot (b) breaks this down further. You can see that topics like Muslim, LGB, and Christian generate significantly higher disagreement (standard deviation) between regions than abstract concepts like “Fairness” or “Care.”

Figure 4: Disagreement between regions on items from each category.

This is a critical insight for AI safety. It means that models trained on Western data regarding LGBTQ+ issues or religious commentary may completely fail to align with the values of users in the Middle East or Latin America, and vice versa.

4. Global Heatmap of Offensiveness

Finally, the researchers looked at the raw “offensiveness” scores by country. Do some countries simply find everything more offensive?

The answer appears to be yes—or at least, the threshold for what is considered “offensive” varies wildly. As shown in the graph below, countries like Egypt, Brazil, and India (top of the list) tended to rate comments as highly offensive much more frequently. Conversely, annotators from China, Singapore, and Australia (bottom of the list) had much lower average offensive scores.

Figure 6: Distribution of the different labels provided by annotators of different countries.

This variance could be due to translation nuances, cultural communication styles (direct vs. indirect), or differing social norms regarding what language is considered taboo.

5. Concrete Examples of Cultural Clash

To make these statistics real, let’s look at the actual sentences that caused the most disagreement. The table below shows instances where regions were completely split.

Take the second example: “Does pointing out that a growing majority of Americans support adultery… change God’s law an iota?”

  • Rated Offensive by: Arab Culture, Indian Cultural Sphere, Latin America, North America.
  • Rated Not Offensive by: Oceania, Sinosphere, Sub-Saharan Africa, Western Europe.

This highlights how a statement invoking religious law and moral judgment is perceived as a valid point in some cultures and an offensive attack in others. A binary “Toxic/Not Toxic” label fails to capture this reality.

Table 3: Instances with highest disagreement across regions.

Conclusion and Implications

The D3CODE paper serves as a wake-up call for the NLP community. As Large Language Models (LLMs) are deployed globally, the “one-size-fits-all” approach to safety and offensiveness is no longer viable.

Key Takeaways for Students:

  1. Data is Subjective: In tasks like toxicity detection, the annotator’s background is a feature, not a bug.
  2. Demographics \(\neq\) Destiny: While where you live matters, your internal moral values (Care, Purity, Authority) are strong predictors of how you perceive language.
  3. Context is King: Disagreement peaks when discussing specific social groups (religion, gender). This is where models are most at risk of failing specific user bases.

The researchers conclude that we need to move toward pluralistic AI models—systems that can understand and respect diverse cultural perspectives rather than enforcing a single, dominant worldview. The D3CODE dataset provides the necessary testbed to start building these culturally aware systems. By disentangling the disagreements in our data, we can start to see the human beings behind the labels.