Stop Cheating: How to Find the "Real" Hard Questions in NLI Datasets

Imagine you are taking a multiple-choice history test. You don’t actually know the history, but you notice a pattern: every time the answer contains the word “never,” it’s the correct choice. You ace the test, scoring 100%. But have you learned history? No. You’ve just learned a statistical shortcut.

This scenario describes a massive problem in current Artificial Intelligence, specifically in Natural Language Inference (NLI). Models like BERT and RoBERTa achieve superhuman scores on benchmark datasets, but they often fail when faced with real-world, nuanced language. Why? Because the datasets they are tested on are full of “spurious correlations”—linguistic shortcuts that allow models to guess the right answer without understanding the logic.

In this post, we are doing a deep dive into the paper “How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics.” The researchers propose a fascinating, automated method to audit these datasets. By analyzing how a model struggles or succeeds during the learning process, they can categorize test questions into “Easy,” “Ambiguous,” and “Hard.”

The result? A reality check for NLP models and a roadmap for building more robust AI.

The Problem: When 90% Accuracy is a Lie

Natural Language Inference (NLI) is a fundamental task in understanding human language. The goal is simple: given two sentences—a Premise and a Hypothesis—the model must determine their logical relationship:

Entailment: The hypothesis is true if the premise is true.
Contradiction: The hypothesis is false if the premise is true.
Neutral: The truth of the hypothesis cannot be determined from the premise.

Popular datasets like SNLI (Stanford NLI) and MultiNLI have been the gold standard for years. Modern Transformers regularly score above 90% on these. However, researchers have long suspected these scores are inflated.

The issue lies in annotation artifacts. When humans create these datasets, they often fall into repetitive habits. For example, to create a “contradiction,” an annotator might simply add the word “not” to the premise. The AI model picks up on this. It learns: If “not” appears, predict Contradiction. It ignores the actual meaning of the sentences.

The “Hypothesis-Only” Baseline

To prove this, researchers run a “Hypothesis-Only” test. They train a model using only the hypothesis sentence, hiding the premise entirely. Logically, this should be impossible—you can’t know if a hypothesis follows from a premise you haven’t seen. The accuracy should be random chance (33% for 3 classes).

However, look at the results from the paper below:

Table 1: Results for RoBERTa on SNLI, MultiNLI and FEVER.

As shown in Table 1, a RoBERTa model trained only on the hypothesis (column Accuracy (H)) achieves 71.7% accuracy on SNLI. This is shockingly high. It proves the model is solving the dataset using shortcuts in the hypothesis text rather than performing actual inference.

The Solution: Characterizing Difficulty via Training Dynamics

The authors of this paper argue that we need a way to separate the “shortcut” examples from the ones that require true reasoning. Manually filtering thousands of examples is impossible. Instead, they propose using Training Dynamics.

The core idea is simple: Watch how the model learns.

Easy examples (shortcuts) are learned quickly and consistently.
Hard examples (requiring logic) take longer to learn, and the model might flip-flop on its prediction during training.

The researchers developed an automated pipeline to categorize the test set into three levels: Easy, Ambiguous, and Hard.

Figure 1: Overall diagram of the method.

As illustrated in Figure 1, the method has three phases. Let’s break them down.

Phase 1: Capturing Training Dynamics

To characterize the test set, the researchers train a model on the test data for a few epochs (usually 5). Note: They aren’t training the model to use it; they are training it to observe it. They track statistics for every single example across these epochs.

They collect four specific metrics for every data point \(x_i\):

1. Confidence (\(\hat{\mu}_i\))

This measures how confident the model is in the correct label on average across the epochs.

Equation 1: Mean Confidence

If a model consistently assigns high probability to the correct class, the example is likely easy.

2. Variability (\(\hat{\sigma}_i\))

This measures how much the model’s prediction fluctuates across epochs.

Equation 2: Variability

High variability usually means the example is “Ambiguous” or confusing to the model. It learns it, forgets it, then learns it again.

3. Correctness (\(\hat{c}_i\))

This is the fraction of epochs where the model predicted the correct label.

Equation 4: Correctness

4. Area Under Margin (AUM)

This metric looks at the difference (margin) between the logit of the correct class and the logit of the second-best class.

Equation 5: AUM

A high AUM means the model is not just correct, but correct by a wide margin (it “knows” it’s right). A negative AUM implies the model consistently prefers the wrong class.

Phase 2: The Double-Check (P+H vs. H-only)

Here is the clever twist. If we only looked at the standard model (Premise + Hypothesis), we might mistake a spurious correlation for an “easy” valid inference.

To catch the artifacts, the authors calculate these 4 metrics twice for every example:

P+H Model: The standard model seeing both sentences.
H-Only Model: A model seeing only the hypothesis.

They concatenate these into a single feature vector representing the “learnability profile” of that test question.

Equation 7: Feature Vector Construction

By including the H-only dynamics, the system can distinguish between “Easy because it’s logical” and “Easy because it has a cheat code.”

Phase 3: Clustering

Finally, they feed these feature vectors into a Gaussian Mixture Model (GMM). Unlike K-Means, which forces hard boundaries, GMMs deal with probabilities distributions. They ask the GMM to find three clusters in the data.

They rank the resulting clusters by average confidence:

Easy: High confidence, low variability.
Ambiguous: High variability, medium confidence.
Hard: Low confidence, low margin.

Analyzing the “Hard” Test Set

So, what happens when we separate the test set using this method? The results validate the “illusion of competence” in standard NLI evaluations.

Performance Collapse

When the researchers took a standard RoBERTa model (trained on the full training set) and evaluated it on these new splits, the performance dropped dramatically on the Hard set.

Table 3: Performance on different splits.

Looking at Table 3:

Easy Split: The model scores 97% on SNLI.
Hard Split: The model scores 56%.

This 56% is a much more realistic estimate of the model’s ability to perform logical reasoning without shortcuts. Notice also the “Accuracy (H)” column. On the Easy split, the hypothesis-only model scores 82% (huge cheating). On the Hard split, it drops to 38% (almost random guessing). This confirms that the “Hard” split successfully filtered out the artifacts.

Visualizing the Dynamics

We can visualize why these splits are different. Figure 2 below shows the distribution of the metrics across the three clusters.

Figure 2: Distributions of feature values across difficulty levels.

In the top row (SNLI), look at the Avg Margin (P+H). The “Easy” group (left) has a very high positive margin. The “Hard” group (right) dips below zero, indicating the model is struggling to distinguish the right answer from the wrong ones.

Where did the shortcuts go?

The authors also verified their method by checking for known spurious correlations, such as:

Word Overlap: Sentences with many shared words are usually “Entailment.”
Negation: Sentences with “not” are usually “Contradiction.”

Table 2: Heuristic measures of spurious correlations.

The algorithm wasn’t told about these heuristics—it only looked at training dynamics. Yet, it automatically sorted them out.

Figure 3: Distributions of spurious correlations across levels.

Figure 3 shows the breakdown. Look at the Contains Negation column (far right). In the Easy row (top), there is a massive spike for the black bar (Contradiction). This means the Easy set is full of “Contradiction = Negation” shortcuts. Now look at the Hard row. The distribution is flat. The correlation is gone. In the Hard set, a sentence can contain “not” and be Entailment, or Neutral. The model actually has to read the text.

The Class Imbalance

An interesting side effect of this characterization is the class distribution.

Figure 4: Counts for each class per difficulty level.

As shown in Figure 4, the Easy splits (top row) are dominated by Contradiction and Entailment. This makes sense, as these classes are easier to “fake” with keyword matching. The Hard splits (bottom row) have a much higher proportion of Neutral examples. Distinguishing “Neutral” (it could be true) from “Contradiction” (it is impossible) requires subtle reasoning that simple pattern matching cannot solve.

Is this Model-Specific?

A valid critique might be: “Maybe you just found examples that are hard for RoBERTa. Maybe DeBERTa finds them easy?”

The authors tested this cross-model capability. They used RoBERTa to create the difficulty splits, and then tested DeBERTa on them (and vice versa).

Table 4: Cross-model comparison.

Table 4 shows that the difficulty levels transfer almost perfectly. An example that is “Hard” for RoBERTa is almost always “Hard” for DeBERTa. This suggests that the difficulty is intrinsic to the linguistic properties of the example, not just a quirk of one specific neural network architecture.

Furthermore, Figure 5 (below) confirms that the “Negation” shortcut is identified and removed by both models similarly.

Figure 5: Comparison between RoBERTa and DeBERTa on Negation.

Implications: Building Better Models with Less Data

The most practical application of this research isn’t just grading test sets—it’s improving training.

If “Easy” examples are just noise and shortcuts, do we even need them? The authors experimented with filtering the training set. They removed the “Easy” examples and trained models only on the Ambiguous and Hard data.

Table 5: Results for RoBERTa trained on filtered SNLI.

In Table 5, look at the row Ours Amb+Hard.

This model used only 59% of the training data.
It achieved better or comparable results on stress tests compared to the model trained on 100% of the data (“All” row).

Conversely, training only on the “Easy” data results in a broken model that fails on stress tests. This proves that a smaller, high-quality dataset (devoid of artifacts) is more valuable than a massive, noisy one.

Conclusion

The paper “How Hard is this Test Set?” provides a sobering but necessary look at the state of Natural Language Inference. It reminds us that high accuracy numbers on leaderboards can be deceptive. If a model is getting the right answer for the wrong reason, it hasn’t learned anything useful for the real world.

By exploiting Training Dynamics, the authors offer a robust, automated way to:

Expose Spurious Correlations: Identifying where models are cheating.
Benchmark Reality: Providing a “Hard” test set that reflects true NLU capabilities (often ~50-60% accuracy, not 90%).
Optimize Training: Showing that we can train efficient models by discarding “junk” easy data.

As we move toward even larger models (LLMs), techniques like this will be essential. We cannot rely on massive scale alone to solve reasoning; we need to ensure our models are being challenged by the data, not just memorizing its statistical flaws. The path to true NLU lies in the “Hard” cluster.

The Problem: When 90% Accuracy is a Lie#

The “Hypothesis-Only” Baseline#

The Solution: Characterizing Difficulty via Training Dynamics#

Phase 1: Capturing Training Dynamics#

1. Confidence (\(\hat{\mu}_i\))#

2. Variability (\(\hat{\sigma}_i\))#

3. Correctness (\(\hat{c}_i\))#

4. Area Under Margin (AUM)#

Phase 2: The Double-Check (P+H vs. H-only)#

Phase 3: Clustering#

Analyzing the “Hard” Test Set#

Performance Collapse#

Visualizing the Dynamics#

Where did the shortcuts go?#

The Class Imbalance#

Is this Model-Specific?#

Implications: Building Better Models with Less Data#

Conclusion#