Sentiment analysis has come a long way from simply classifying a movie review as “positive” or “negative.” In the era of granular data analytics, we are interested in Aspect-Based Sentiment Analysis (ABSA). We don’t just want to know if a user is happy; we want to know what they are happy about, which specific feature they like, and what opinion words they used.
This brings us to the Sentiment Quadruple: A structured set of four elements:
- Target: The object (e.g., “iPhone 13”).
- Aspect: The component (e.g., “battery life”).
- Opinion: The expression (e.g., “lasts all day”).
- Sentiment: The polarity (e.g., “Positive”).
Extracting these quadruples from clean, short sentences is a solved problem. But extracting them from dialogues—messy, multi-turn conversations with interruptions, slang, and topic shifts—is a massive challenge.
In this post, we are doing a deep dive into the research paper “Overcome Noise and Bias: Segmentation-Aided Multi-Granularity Denoising and Debiasing for Enhanced Quadruples Extraction in Dialogue.” This paper proposes a sophisticated framework called SADD to handle the two biggest enemies of dialogue analysis: Noise and Order Bias.
The Core Problems: Noise and Bias
Before we look at the solution, we have to understand why existing “Generative Methods” (models that generate the output text sequence) fail in dialogue scenarios.
1. The Noise Problem
Dialogues are chatty. Unlike a formal review, people in a conversation might say, “Oh, I see,” or digress into irrelevant details. These extraneous words are “noise.”
In a deep learning model, attention mechanisms look at everything. If a user says, “I didn’t buy the Samsung because the screen was dim, but I bought the Xiaomi because the price was low,” a naive model might get confused by the proximity of words. It might hallucinate that the “Xiaomi” has a “dim screen.”
2. The Order Bias Problem
This is a more subtle, mathematical problem. When we train a model to generate a list of quadruples, we have to force the quadruples into a sequence.
Imagine a sentence contains two facts:
- Fact A: (iPhone, screen, good, POS)
- Fact B: (Samsung, price, high, NEG)
The model must output them one by one. But which one first? A or B? In reality, the order doesn’t matter—it’s a set of facts. However, training data usually has a fixed order (often just the order they appear in the sentence).
If the model is always trained to output A then B, it starts to learn a false causal relationship. It thinks the generation of Fact B depends on Fact A, or that there is a semantic reason for the order. This is Order Bias, and it hurts the model’s ability to generalize.

As shown in Figure 1, noise (orange text) distracts the model, while order bias (yellow boxes) forces the model to learn dependencies that don’t exist.
The Solution: SADD
The researchers propose SADD (Segmentation-Aided multi-grained Denoising and Debiasing). It’s a mouthful, but the architecture is elegant. It attacks the problem on two fronts:
- Denoising via a module called MGDG (Multi-Granularity Denoising Generation).
- Debiasing via a module called SOBM (Segmentation-aided Order Bias Mitigation).
Let’s break down the architecture.

As seen in Figure 2, the model processes the dialogue through an Encoder, passes it through labeling and segmentation modules to clean it up, and then uses a Decoder to generate the final quadruples.
Part 1: Killing the Noise (MGDG)
The “Multi-Granularity” in MGDG means the model cleans noise at two levels: the Word Level and the Utterance (Sentence) Level.
Word-Level Denoising: Sequence Labeling
First, the model looks at every single word in the dialogue. It assigns a label to each word: Is this a Target? An Aspect? An Opinion? Or is it None?
This is a classic sequence labeling task. By calculating the probability that a specific word is “None,” the model learns which words are likely irrelevant noise. This probability map (\(P\)) is saved for later use in the decoder.
Utterance-Level Denoising: Topic-Aware Dialogue Segmentation (TADS)
This is where the method gets innovative. In a long conversation, people switch topics. If we are currently extracting sentiment about “Battery Life,” sentences discussing “Customer Service” are effectively noise.
Existing methods try to compare every sentence to every other sentence to find topics, which is computationally heavy and often inaccurate in complex chats. The SADD method uses Topics as a Bridge.
- Identify Topics: The model treats words identified as “Targets” (from the word-level step) as potential Topics.
- Cross-Attention: It uses a cross-attention mechanism where the Topics act as Queries and the Utterances act as Keys/Values.
- Clustering: It predicts which utterances relate to which topic.
This creates Topic Masks. If Utterance 1 and Utterance 3 are about “iPhone,” they get grouped. If Utterance 2 is about “Samsung,” it gets its own group. When the model tries to extract info about the iPhone, it applies the mask to completely hide Utterance 2.
Denoised Attention: Putting it Together
How does the model use this info? It modifies the Cross-Attention mechanism in the final Decoder. Standard attention allows the model to look anywhere. Denoised Attention restricts the view.
The researchers introduce a mathematical adjustment to the attention weights.

In this equation:
- \(\hat{P}_j\) is the probability that a word is meaningful (derived from word-level labeling).
- \(m_j^{(k)}\) is the Topic Mask (derived from utterance-level segmentation).
- \(r_j\) becomes a “gate.” If a word is noisy OR if the sentence is off-topic, \(r_j\) goes to 0.
- \(w'_i\) is the new attention weight.
Effectively, the model is blindfolded to noise at both the word and sentence levels, forcing it to focus only on relevant data.
Part 2: Eliminating Order Bias (SOBM)
Now that the input is clean, we must address the output generation. The researchers provide a fascinating theoretical proof that Order Bias comes from a gap between the Ideal training objective and the Actual training objective.
The Mathematical Gap
In an ideal world, we would train the model using Maximum Likelihood Estimation (MLE) on all possible valid orders of the quadruples.

Ideally, for a set of quadruples \(S\), the loss function should consider the sum of probabilities of all permutations \(\Pi(S)\).

However, in practice, we usually pick one specific order (e.g., appearance order) and train on that. This is the Actual objective:

The paper proves that the difference (Gap \(\Delta\)) between these two objectives is not zero. This gap represents the bias—the model is penalized for outputting correct tuples in the “wrong” order, which confuses it.

The Solution: Segmentation-Aided Augmentation
Since we cannot easily change the loss function to handle infinite permutations (that leads to a “one-to-many” mapping problem which is hard to converge), the researchers decided to change the Data.
They propose SOBM (Segmentation-aided Order Bias Mitigation). The idea is to create an Augmented Dataset where the input dialogues are rearranged, and the target labels are reordered.
But you can’t just shuffle words in a sentence; that destroys meaning. You can, however, shuffle Reply Threads.

As shown in Figure 5, dialogues are often tree-structured. If Thread A (Utterances 1, 2, 3) and Thread B (Utterances 1, 4, 5) branch off from the same root but discuss different things, they are semantically independent.
The Strategy:
- Use the dialogue segmentation (from Part 1) to find these independent threads.
- Rearrange the threads in the input to create new, semantically equivalent dialogues (\(\hat{x}\)).
- Pair these new inputs with different permutations of the output quadruples (\(y\)).
By flooding the training data with these variations, the data distribution (\(p_{aug}\)) changes. The paper proves mathematically that under this augmented distribution, the gap between Ideal and Actual training objectives shrinks to approximately zero.

This means the model finally understands that order doesn’t matter, solving the bias problem without needing a complex new loss function.
Experiments and Results
Does this complex architecture actually work? The researchers tested SADD on the DiaASQ benchmark (containing both Chinese and English dialogues).
Dataset Stats
The dataset is substantial, containing thousands of dialogues with intricate reply structures.

Main Performance
The results are impressive. SADD was compared against several baselines, including discriminative models (like SpERT) and generative models (like ParaPhrase).

Looking at Table 1:
- SADD (Ours) achieves the highest scores across the board.
- In the English Quadruple Extraction (Identification F1), SADD scores 41.05%, significantly beating the previous best (MvI) at 37.51%.
- This improvement is consistent in both English (EN) and Chinese (ZH) datasets.
Ablation Studies: Do we need both parts?
The researchers turned off MGDG (Denoising) and SOBM (Debiasing) separately to see what happened.

Table 2 shows clearly:
- Removing MGDG causes a massive drop (e.g., from 41.05% to 38.39% in EN Iden F1).
- Removing SOBM also hurts performance significantly.
- The combination of both yields the best results.
Fighting the Noise
One of the most compelling results is the reduction in errors caused specifically by noise.

Table 3 shows that in the previous state-of-the-art model (MvI), nearly 80% of errors were due to noise. SADD reduced this to 48.67%. This is a drastic improvement in robustness.
Case Study
Visualizing the output helps understand the improvement.

In Figure 4, look at the first example regarding the “Meizu 18”.
- The user says “Meizu 18 is just a backup machine.”
- The Baseline (MvI) gets confused and extracts:
(Meizu 18, machine, backup, neg). This is incorrect; “machine” isn’t a feature/aspect, it’s just a noun referring to the phone. - SADD correctly ignores this noise and does not extract a false quadruple, thanks to the word-level denoising.
Why This Matters
The SADD model represents a significant step forward because it treats the structure of the data as part of the solution.
- It respects the nature of dialogue: By acknowledging that conversations have topics and threads, it uses segmentation (TADS) to filter context intelligently.
- It fixes the math of generation: By proving the “Ideal-Actual Gap,” it provides a theoretical justification for data augmentation, rather than just doing it blindly.
Limitations
The authors note that while effective, the augmentation strategy increases training time because the dataset becomes larger. Also, the current implementation uses BART, and the complexity of attention scales with sequence length—a known issue in Large Language Models (LLMs) processing long contexts.
Conclusion
Extracting sentiment from dialogue is like trying to listen to a specific conversation in a crowded room. You need to tune out the background noise (Denoising) and you need to organize the snippets of information you hear without assuming a false sequence of events (Debiasing).
The SADD method provides a robust blueprint for doing exactly this. By combining Sequence Labeling, Topic-Aware Segmentation, and clever Data Augmentation, it sets a new standard for how AI understands our messy, human conversations. For students and researchers in NLP, this paper is a masterclass in diagnosing specific model failures (noise and bias) and designing targeted architectural fixes to solve them.
](https://deep-paper.org/en/paper/file-3468/images/cover.png)