Confidence Check: Can Data Augmentation Fix Overconfidence in NER Models?
Imagine a doctor using an AI assistant to scan medical records for patient allergies. The AI flags “Penicillin” with 99% confidence. The doctor trusts it. But what if the AI misses a rare drug name, or worse, identifies a vitamin as a dangerous allergen with that same 99% confidence?
This scenario highlights a critical flaw in modern Deep Neural Networks (DNNs): miscalibration. Modern models are often “overconfident,” assigning high probability scores to predictions even when they are wrong. In safety-critical fields like healthcare, finance, or autonomous driving, accurate predictions aren’t enough—we need to know how much to trust those predictions.
In Natural Language Processing (NLP), specifically Named Entity Recognition (NER)—the task of identifying names, dates, and locations in text—this problem is rampant. While researchers have developed methods to estimate uncertainty (like Monte-Carlo Dropout), these solutions usually come with a heavy price: they drastically slow down the model during inference.
But what if the solution wasn’t a complex new algorithm, but simply… more data?
In a fascinating paper titled “Are Data Augmentation Methods in Named Entity Recognition Applicable for Uncertainty Estimation?”, researchers from the Nara Institute of Science and Technology investigate whether standard Data Augmentation (DA) techniques—typically used just to boost accuracy—can also teach models to be more honest about their uncertainty, without slowing them down.
The Problem: When 99% Isn’t 99%
To understand the solution, we first need to understand “Calibration.”
In a perfectly calibrated model, if we look at all the predictions made with 70% confidence, the model should be correct exactly 70% of the time. If the model is correct only 40% of the time on those predictions, it is miscalibrated (specifically, overconfident).
Pre-trained Language Models (PLMs) like BERT and DeBERTa are powerful, but they are notorious for this overconfidence.
Measuring the Gap
How do we measure this? The researchers rely on two primary metrics. The first is Expected Calibration Error (ECE). Think of this as the weighted average difference between the model’s confidence and its actual accuracy across different “bins” of probability.

Here, \(n\) is the total number of samples, and the formula sums up the gap between accuracy (\(acc\)) and confidence (\(conf\)) for each bin.
However, averages can hide outliers. For high-stakes applications, we might care more about the worst-case scenario. For this, they use Maximum Calibration Error (MCE), which looks for the single bin where the gap between confidence and accuracy is the largest.

A lower score for both ECE and MCE means the model knows its own limits better.
The Contenders: Standard Fixes vs. Data Augmentation
The “Expensive” Incumbents
The paper compares new ideas against established methods for fixing calibration.
Temperature Scaling (TS): A post-processing step that softens the model’s output distribution using a parameter \(T\). It’s fast but requires a separate validation set to tune.

Label Smoothing (LS): This technique tells the model during training, “Don’t be 100% sure about the correct label; save a little probability for the others.”

Monte-Carlo (MC) Dropout: This is the “gold standard” for uncertainty. It involves keeping Dropout (randomly disabling neurons) active during inference and running the model multiple times (e.g., 20 times) for the same input, then averaging the results.

The Catch: While MC Dropout works well, running a model 20 times for every single sentence makes it 20 times slower. In real-time applications, this is often a dealbreaker.
The Challenger: Data Augmentation (DA)
The core hypothesis of this paper is that Data Augmentation—creating variations of training data to prevent overfitting—might naturally improve calibration. If the model sees more diverse examples during training, it might learn a smoother decision boundary, leading to better uncertainty estimates.
The researchers tested four specific DA methods for NER:
- Label-wise Token Replacement (LwTR): Randomly replacing a token with another token that shares the same label distribution.
- Synonym Replacement (SR): Replacing words with their synonyms from a database (WordNet).
- Mention Replacement (MR): This is specific to NER. It replaces an identified entity (like “New York”) with another entity of the same type (like “London”) found in the training data.
- Masked Entity Language Modeling (MELM): A more advanced method that uses a BERT-like model to predict contextually appropriate replacements for entities.
The Strategic Advantage: Unlike MC Dropout, Data Augmentation happens offline during training. The final model is just a standard model. Inference time remains unchanged.
Experimental Setup
The researchers conducted a comprehensive evaluation using two massive datasets to test different scenarios:
- OntoNotes 5.0: Used for Cross-Genre evaluation. They trained on one genre (e.g., Broadcast News) and tested on another (e.g., Telephone Conversation) to see how the model handles slightly different writing styles.
- MultiCoNER: Used for Cross-Lingual evaluation. Training in English and testing in German, Spanish, Hindi, etc.

As shown in the table above, these datasets provide a robust testing ground with varied sizes and domains.
Key Findings
1. In-Domain: Data Augmentation is a Winner
When the test data comes from the same domain as the training data (In-Domain or ID), Data Augmentation methods showed remarkable results.

Look at the table above (Table 3). The rows for MR (Mention Replacement) and MELM frequently outperform the Baseline, Temperature Scaling (TS), and even the computationally expensive MC Dropout.
- Result: MELM achieved up to a 6.01% improvement in ECE compared to the baseline in certain domains.
- Significance: This confirms that simply showing the model more variations of entities helps it calibrate its confidence scores without needing complex inference techniques.
2. The Perplexity Connection: Why “MR” Works Best
Not all data augmentation is created equal. The researchers analyzed the perplexity of the generated sentences. Perplexity effectively measures how “surprised” a language model is by a sentence; low perplexity means the sentence sounds natural.
They found a strong correlation: Lower perplexity leads to better calibration.
- Mention Replacement (MR) consistently produced the lowest perplexity because it swaps whole entities (e.g., “President [Obama]” \(\rightarrow\) “President [Biden]”) rather than random tokens. This preserves the grammatical structure.
- LwTR, which swaps individual tokens, often created “noisier,” less natural sentences. Consequently, LwTR often performed worse for uncertainty estimation.
3. The “More is Better” Effect (Usually)
Does throwing more augmented data at the model help? It depends on the method.

In Figure 2 (top row), we see the trends for the “Telephone Conversation” domain.
- MR (Orange line): As the augmentation size increases (x-axis), the ECE (error) generally decreases or stays low.
- LwTR (Blue line): In some cases, adding more LwTR data actually increased the error. Because LwTR generates noisier data, adding too much of it can confuse the model rather than smooth it out.
4. The Out-Of-Domain (OOD) Limit
Here is the crucial limitation. While DA works wonders when the test data is similar to the training data, it struggles when the test data is significantly different (Out-Of-Domain).

The visualization above (Figure 1) explains why using t-SNE (a way to map high-dimensional data into 2D).
- Red dots: Original Training Data.
- Blue dots: Augmented Data (MELM).
- Purple dots: Out-of-Domain Test Data (Web Data).
Notice how the Blue dots (Augmented) cluster tightly around the Red dots (Training). The augmentation methods generate variations similar to what the model already knows. However, the Purple dots (OOD test data) form clusters far away from both.
Conclusion: Data augmentation fills in the gaps within the known domain, improving calibration there. But it does not magically teach the model about completely new domains or writing styles that it has never seen.
5. Cross-Lingual Uncertainty
The team also tested “Zero-Shot Cross-Lingual Transfer”—training in English and testing in Spanish or Hindi without seeing any target language data.

Similar to the genre transfer, Figure 4 shows a massive gap between English training data (Red) and Hindi test data (Purple). While DA helped slightly in languages that are linguistically close to English (like German or Spanish), it couldn’t bridge the gap for distant languages.
However, for low-resource languages, DA still showed promise.

As seen in Table 13 (Bangla), Synonym Replacement (SR) and LwTR actually provided better ECE scores than the baseline, suggesting that even simple noise injection can prevent the model from being recklessly overconfident in difficult, low-resource languages.
Conclusion: A Free Lunch?
So, is Data Augmentation the silver bullet for uncertainty?
The Pros:
- Efficiency: It improves uncertainty estimation with zero increase in inference time.
- Performance: In In-Domain settings, methods like Mention Replacement (MR) often beat expensive methods like MC Dropout.
- Simplicity: It does not require changing the model architecture or loss functions.
The Cons:
- OOD Limitations: It doesn’t fix overconfidence on data that is vastly different from the training set.
- Quality Control: The quality of the augmented data matters. “Noisy” augmentation (high perplexity) can hurt calibration.
For students and practitioners, the takeaway is clear: If you are building an NER system for a safety-critical application and you know your deployment data will be similar to your training data, Data Augmentation—specifically Mention Replacement—is a powerful, computationally “free” tool to ensure your model isn’t just accurate, but also trustworthy.
It turns out that teaching an AI to know when it might be wrong simply requires teaching it more ways to be right.
](https://deep-paper.org/en/paper/2407.02062/images/cover.png)