Introduction

Imagine you are scrolling through Twitter (now X). You see a thread where User A makes a comment about diet and exercise. It seems harmless enough. But then, User B replies angrily, claiming User A is body-shaming them. User A, confused, replies, “I didn’t mean to offend you; I was just sharing what my doctor told me.”

In the world of Natural Language Processing (NLP), detecting the offense in User A’s original post is incredibly difficult. It doesn’t contain swear words, racial slurs, or explicit threats. The offense is unintended and implicit, relying entirely on the context and the receiver’s interpretation.

Most current AI models are trained on “intended affective datasets.” These are collections of texts gathered using specific keywords like “#hate” or explicit profanity. While effective at catching trolls and bullies, these models often fail to capture the subtle misunderstandings that fuel everyday online conflict.

In a recent paper titled “Leveraging Conflicts in Social Media Posts: Unintended Offense Dataset”, researchers from National Tsing Hua University propose a novel solution. Instead of searching for hate speech, they searched for conflict resolution. By finding tweets where people say “I didn’t mean to offend,” they worked backward to identify the text that caused the problem. This blog post explores their methodology, the creation of the Unintended Offense Dataset, and why understanding human conflict is the next frontier for safe online spaces.

The Problem: Intention vs. Perception

To understand why this research is necessary, we first need to look at the limitations of existing offensive language detection.

Traditionally, datasets are built by focusing on the sender’s intention. Researchers scrape social media for known hate symbols, slurs, or hashtags like #bully. This assumes that offense is objective and explicitly stated. However, communication is a two-way street.

  1. Subjectivity: Offensiveness is often in the eye of the beholder. A statement might be neutral to the sender but deeply offensive to the receiver based on their lived experience.
  2. Context: A sentence like “You don’t know one thing about me” might be neutral in isolation but aggressive in a specific conversation.
  3. Unintended Offense: Often, people offend others without meaning to. These instances are rarely captured in datasets like OLID (Offensive Language Identification Dataset) or HateEval because they lack the explicit markers of hate speech.

The researchers argue that we need to shift our focus from “intra-conflict” (internal inconsistency) to “inter-conflict”—the disagreements that happen between people.

The Core Method: Mining Inter-Conflict

The genius of this paper lies in its data collection strategy. Rather than looking for the offense itself, the researchers looked for the aftermath of the offense.

The Anatomy of a Conflict

The team identified a specific conversational pattern that signals an unintended offense. They call this “Response Cue” collection.

Figure 1: Inter-conflict in the conversation.

As shown in Figure 1, a typical conflict thread involves four distinct components:

  1. Context Post: The conversation starter (e.g., User A complaining about knee pain).
  2. Target Post: The potentially offensive reply (e.g., User B suggesting weight loss). Note that the intention here is labeled “Not Offensive” by the sender.
  3. Follow-up Post: The reaction from the offended party (e.g., User A feeling judged). This signals the “Possible Perceived” offense.
  4. Response Cue Post: The sender clarifies their intent (e.g., “Didn’t mean to offend you”).

The Response Cue Post is the golden ticket. It acts as a retrospective label. If someone says “I didn’t mean to offend,” it implies two things:

  • The previous message was interpreted as offensive.
  • The offense was unintended (implicit).

By querying for phrases like “didn’t mean to offend,” “no offense intended,” or “sorry if that sounded rude,” the researchers could identify the Target Post—the implicit offense—without needing to search for slurs.

The Data Collection Framework

Collecting these threads isn’t as simple as a keyword search. Social media data is noisy. The researchers developed a rigorous pipeline to ensure they were capturing genuine interpersonal conflicts rather than random noise.

Figure 2: Overall framework

Figure 2 outlines the complete framework, which operates in several stages:

1. Querying and Phase 1 Filtering

The process begins with the Twitter API, using sentence templates (the response cues) to find potential conflicts. However, simply finding the phrase “didn’t mean to offend” isn’t enough. They applied two critical filters immediately:

  • Quotation Filter: Removes tweets that are quoting someone else’s apology.
  • Ambiguous Pronoun Filter: Removes tweets clarifying someone else’s intent (e.g., “He didn’t mean to offend you”). The goal is to capture the speaker’s own conflict.

2. Conversation Reconstruction

Once a valid cue is found, the system reconstructs the thread. It traces back from the apology (Response Cue) to find the Target Post and the Context Post. This provides the full picture of the interaction.

3. Conversation Dynamic Filter

This is a sophisticated step designed to handle the chaos of multi-turn threads. In a long thread, if User A says “I didn’t mean it,” it can be unclear which previous post they are referring to.

The researchers defined a regex-based filter to enforce a strict dialogue structure: Y+(X)Y+X$.

  • X is the author of the apology (and the offensive post).
  • Y is the offended party.

This filter ensures the conversation strictly alternates or follows a pattern where the offender (X) speaks, the offended party (Y) reacts, and the offender (X) apologizes. Threads that didn’t fit this clean “conflict-reaction-resolution” structure were discarded to maintain data quality.

4. Phase 2 Filtering

Finally, the data goes through a hygiene check:

  • Language Filter: English only.
  • URL Filter: Posts with links are removed. This ensures the offense is contained within the text, not in a linked video or image.

Human Annotation

After filtering, the researchers were left with 4,027 high-quality threads. They randomly sampled 2,401 conversations and used Amazon Mechanical Turk (AMT) for annotation.

Crucially, they asked annotators to:

  1. Roleplay as the Receiver: Assume the perspective of the person being replied to (User A).
  2. Rate Offensiveness: On a scale of 0 to 100.
  3. Context-Awareness: Rate the post both with and without the surrounding context.

This yielded a dataset richer in subtle, implicit offense than anything currently available.

Analyzing the Dataset

How does this “Unintended Offense Dataset” compare to standard benchmarks like OLID, Founta, or Waseem? The analysis reveals three major advantages: higher implicitness, lower topic bias, and the crucial role of context.

1. Capturing Implicit Offense

Implicit offense is subtle. It’s “micro-aggression” rather than “aggression.” The researchers measured the percentage of offensive messages that were implicit (i.e., lacking explicit profanity or slurs).

Table 2: The results of the implicitness measurement.

As shown in Table 2, the new dataset (“Ours”) achieves an implicitness score of 74.40%. Compare this to the Founta dataset (22.13%) or OLID (37.90%).

  • Why this matters: Datasets like Waseem or OLID rely on biased sampling (searching for specific bad words). This naturally results in datasets full of explicit offense. By searching for conflict cues instead, the new method captures the subtle digs that fly under the radar of keyword filters.

2. Reducing Topic Selection Bias

One of the biggest flaws in offensive language datasets is topic bias. If you build a dataset by searching for racial slurs, your dataset will be heavily skewed toward discussions about race. If you search for political insults, it will be skewed toward politics.

Because the proposed method searches for apologies (“I didn’t mean to…”), it is topic-agnostic. People apologize for misunderstandings about sports, gaming, cooking, and coding just as often as they do about politics.

To prove this, the researchers compared the topic distribution of their dataset against a reference dataset (Founta) constructed via random sampling (which represents a “natural” distribution of topics). They calculated the Topic Selection Bias using Cosine Distance.

()\nv ^ { r e f } = \\frac { \\sum _ { i = 1 } ^ { | D ^ { r e f } | } v _ { i } ^ { r e f } } { | D ^ { r e f } | }\n[

First, they aggregated the topic vectors for the reference dataset (\(v^{ref}\)) using Equation 1.

]\nv ^ { t a r } = \\frac { \\sum _ { i = 1 } ^ { \\left| D ^ { t a r } \\right| } v _ { i } ^ { t a r } } { \\left| D ^ { t a r } \\right| }\n[

Then, they did the same for the target datasets (\(v^{tar}\)) using Equation 2.

]\nB i a s ( D ^ { t a r } ) = 1 - \\frac { v ^ { r e f } \\cdot v ^ { t a r } } { | | v ^ { r e f } | | \\times | | v ^ { t a r } | | }\n[

Finally, they calculated the bias using Equation 3. A lower score means the dataset’s topics look more like “real life” and less like a curated list of controversial subjects.

Table 3: Topic selection bias comparison.

Table 3 confirms the hypothesis. The new dataset has a bias score of 0.063, significantly lower than Kumar (0.280) or Waseem (0.151). This means models trained on this data won’t learn to associate specific innocent topics (like “Muslim” or “feminism”) with offense merely because they appear frequently in the training data.

3. The Influence of Context

Does context make things worse or better? The researchers compared offensiveness ratings when annotators saw only the target tweet versus when they saw the full conversation.

]\n\\delta = R _ { c o n } - R _ { u n c o n }\n()

They calculated the difference (\(\delta\)) between the rating with context (\(R_{con}\)) and without (\(R_{uncon}\)).

Figure 3: The proportion of positive and negative influence on implicitly offensive tweets with offensiveness \\(\\geq T\\)

Figure 3 illustrates a fascinating trend. The blue bars represent “positive influence” (context makes it more offensive), and orange represents “negative influence” (context makes it less offensive).

As the offensiveness threshold (\(T\)) rises—meaning we look at more severely offensive tweets—the blue bars dominate. This suggests that for unintended/implicit offense, context usually makes the offense clearer and more severe. Without context, a rude remark might just look like a statement of fact.

Experiments and Results

The researchers validated their dataset by testing whether current AI models could learn from it. They ran experiments using BERT (fine-tuning) and GPT-4 (zero-shot and few-shot prompting).

RQ1: Can existing models detect Unintended Offense?

First, they trained BERT models on existing datasets (OLID and Founta) and tested them on the new Unintended Offense dataset.

Table 4: The BERT Offense Classification Results.Test on Founta(-) and Ours \\(( 5 0 + )\\) . \\(N { = } 2 6 2 / 2 6 2\\)

The top rows of Table 4 show the results.

  • OLID-trained model: Achieved a Macro F1 score of only 0.501.
  • Founta-trained model: Achieved a Macro F1 score of 0.458.
  • Recall is the problem: Look at the Recall for the “Offensive” class (0.260 and 0.141). This means the models missed the vast majority of unintended offenses, classifying them as safe.

Conclusion: Models trained on traditional, keyword-heavy datasets are practically blind to unintended offense.

RQ2: Does the new data improve detection?

Next, they included the new data in the training process.

  • BERT: When fine-tuned on the new dataset (“Ours”), the Macro F1 score jumped to 0.802 (labeled data) and even 0.843 when using a larger set of unlabeled data.
  • GPT-4: They tested GPT-4 with “Few-Shot” prompting (giving the AI a few examples before asking it to classify).

Table 5: The GPT Offense Classification Results. Test on Founta (-) and Ours \\(( 5 0 + )\\) \\(N { = } 2 6 2 / 2 6 2\\)

Table 5 shows the GPT-4 results:

  • Zero-Shot: F1 score of 0.468 (performs poorly without help).
  • 5-Shot (using Founta examples): F1 score of 0.565.
  • 5-Shot (using “Ours” examples): F1 score of 0.728.

The jump from 0.565 to 0.728 is massive. It proves that showing GPT-4 just five examples of “unintended offense” (inter-conflict) significantly helps it understand the concept, far more than showing it examples of standard hate speech.

Conclusion and Implications

The paper “Leveraging Conflicts in Social Media Posts” highlights a blind spot in current AI safety measures. By focusing so heavily on what is said (explicit slurs), we have neglected how it is received (implicit offense).

The researchers successfully demonstrated that:

  1. Conflict Cues work: Searching for “I didn’t mean to offend” is a viable way to mine high-quality, subtle offensive data.
  2. Context is King: Unintended offense relies heavily on the conversation history, which traditional datasets ignore.
  3. Topic Neutrality: This method produces data that represents a wider slice of life, reducing the bias that plagues current AI models.

For students of NLP and Data Science, this paper offers a valuable lesson: Data collection strategies define model behavior. If you train a model only on slurs, it will only detect slurs. To build AI that truly understands human communication, we must look beyond the words themselves and analyze the complex dynamics of conflict, perception, and resolution.

The Unintended Offense Dataset is a significant step toward AI that understands not just what we say, but how it makes others feel.