Introduction
Imagine teaching a child to read. You wouldn’t start by handing them a complex legal contract or a page from Shakespeare. Instead, you begin with simple sentences: “The cat sat on the mat.” Once they master the basics, you gradually introduce more complex grammar, vocabulary, and sentence structures. This intuitive progression—learning the easy stuff before the hard stuff—is the foundation of Curriculum Learning (CL) in artificial intelligence.
In the field of Natural Language Processing (NLP), however, we often ignore this intuition. We tend to train models by feeding them data in random batches, mixing simple phrases with incredibly complex, ambiguous sentences.
This approach becomes particularly problematic in Sequence Labeling tasks, such as Part-of-Speech (POS) tagging or Named Entity Recognition (NER). Modern models try to improve accuracy by incorporating “heterogeneous knowledge”—external data like lexicons, syntax graphs, or n-grams. While this extra knowledge helps, it makes the input data messy and the models heavy. The result? Training becomes computationally expensive and slow.
In this post, we will deep dive into a research paper that proposes a solution: a Dual-stage Curriculum Learning (DCL) framework. The researchers demonstrate that by strategically ranking data difficulty and adapting the curriculum as the model learns, we can train complex models faster and achieve better performance.
The Challenge: Heterogeneity and Complexity
Before understanding the solution, we need to understand the bottleneck. Sequence labeling involves assigning a label (like “Noun” or “Person”) to every element in a text sequence.
To achieve state-of-the-art results, researchers often enhance basic neural networks with external knowledge. For example, a Chinese Word Segmentation (CWS) model might look up words in a dictionary or analyze the syntactic structure of a sentence. While beneficial, this adds two layers of difficulty:
- Data Heterogeneity: The input isn’t just text anymore; it’s text plus graph data, lexicon matches, and syntactic trees.
- Model Complexity: To process this extra data, models need extra modules (like Graph Neural Networks or complex Attention mechanisms), increasing the parameter count.
This combination creates a “slow training” issue. The model struggles to converge because it is trying to learn complex relationships from the very first epoch.
The Solution: Dual-Stage Curriculum Learning (DCL)
The core contribution of this paper is a framework that splits curriculum learning into two distinct stages: Data-level and Model-level.

As illustrated in Figure 1, the framework operates as follows:
- Data-level CL (The Teacher): Before the main model (the Student) starts learning, a simpler “Teacher” model scans the entire dataset. It provides an initial difficulty ranking. This solves the “cold start” problem—the Student doesn’t know what is hard or easy yet, so the Teacher provides a preliminary roadmap.
- Model-level CL (The Student): As the Student begins training on the easy data, it starts to form its own “opinions” on what is difficult. The framework dynamically re-ranks the remaining data based on the Student’s current state.
This dual approach ensures that the curriculum is not static. It evolves as the Student becomes smarter.
The Training Process
The training follows a specific algorithm (Algorithm 1 in the paper). Here is the simplified flow:
- Teacher Phase: Train a basic model for a few epochs. Use it to sort the dataset from Easy \(\rightarrow\) Hard.
- Initialization: Give the Student model a small slice of the easiest data (defined by a ratio \(\lambda_0\)).
- The Loop:
- Train the Student on the current “easy” subset.
- Use the Student model to evaluate the difficulty of the remaining (unseen) data.
- Re-rank the remaining data.
- Increase \(\lambda\) (the amount of data allowed) using a scheduler.
- Add the next batch of “easiest” remaining samples to the training set.
- Repeat until all data is used.
The Scheduler
How fast should we introduce harder data? If we go too fast, the model gets overwhelmed. If we go too slow, training takes forever. The authors use a Root Function to control the pace, represented by \(\lambda\) (the proportion of data used).

In this equation:
- \(\lambda_0\) is the starting percentage of data (e.g., the easiest 30%).
- \(t\) is the current epoch.
- \(E_{grow}\) is the number of epochs it takes to reach full dataset usage.
The square root function ensures that the model gets plenty of time to digest the new data early on, with the intake slowing down as training progresses.
Measuring Difficulty: What Makes a Sentence “Hard”?
The success of Curriculum Learning hangs on one question: How do you define difficulty?
If you sort the data incorrectly (e.g., giving the hardest examples first), the model might fail completely. The authors explored several metrics:
1. Simple Baselines
- Sentence Length: The assumption is that longer sentences are harder. While often true, a short sentence with ambiguous words can be harder than a long, simple one.
- Random: No curriculum (the standard approach).
2. Confidence-Based Metrics
- TLC (Top-N Least Confidence): Looks at the tokens where the model is least confident.
- MNLP (Maximum Normalized Log-Probability): Uses the probability of the predicted labels to gauge confidence.
3. The Winner: Bayesian Uncertainty (BU)
The authors propose that the best way to measure difficulty is Uncertainty. If the model is unsure about a prediction, that sample is “hard” for the current model state.
To measure this, they use Monte Carlo Dropout. In a standard neural network, you make one pass to get a prediction. With Monte Carlo Dropout, you run the input through the model \(K\) times, randomly dropping out (turning off) different neurons each time.
If the model is confident, it will give roughly the same prediction every time, regardless of which neurons are off. If it is uncertain, the predictions will vary wildly.
Step 1: The Expectation First, they approximate the average probability for a token \(y_i\) given input \(x_i\) over \(K\) runs:

Step 2: The Variance Next, they calculate the variance (disagreement) among those \(K\) predictions. High variance means high uncertainty.

Step 3: Sequence Scoring Sequence labeling is about the whole sentence, not just one word. The authors combine two views of the sentence’s variance:
- Average Variance: How uncertain is the model about the sentence on average?

- Max Variance: What was the most confusing part of the sentence? (Even one very hard word can make a sentence difficult).

Step 4: The Final Score The final difficulty score (\(S\)) for a sentence using the Bayesian Uncertainty (BU) metric is the sum of these two:

This metric allows the Student model to say, “I am very confused by this specific sentence right now, so save it for later.”
Note: For completeness, here are the equations for the baseline metrics TLC and MNLP which the authors also tested, though they proved less effective than BU.
TLC (Least Confidence):

MNLP (Log-Probability):

Experiments and Results
Does this actually work? The researchers tested the DCL framework on three datasets: CTB5, CTB6, and PKU, covering Chinese Word Segmentation (CWS) and Part-of-Speech (POS) tagging. They applied DCL to two different complex state-of-the-art models (McASP and SynSemGCN).
Performance Gains
The results in Table 1 show a clear trend. The models trained with DCL (specifically using the BU metric) consistently outperformed the baselines.

Key Takeaways from the Data:
- DCL vs. No CL: The standard models (rows with “-”) are consistently beaten by the DCL variants.
- BU vs. Others: The Bayesian Uncertainty (BU) metric achieves the highest F1 scores in almost every category (highlighted in bold). It outperforms simple metrics like “Length” and other confidence metrics like “TLC.”
Speed and Efficiency
One of the main goals was to reduce training costs. The ablation study (Table 2 in the paper) revealed that using DCL reduced training time significantly.
- Standard Training: 393 minutes.
- DCL Training: 287 minutes.
That is a ~27% reduction in training time while achieving higher accuracy. This happens because the model learns the basics quickly and doesn’t waste time struggling with hard examples before it’s ready.
Learning Curve Analysis
To visualize why BU works better, the authors plotted the F1 scores on the development set during the first 10 epochs.

In Figure 2, look at the purple line (BU). It rises the fastest and stays at the top.
- Random (Blue) is the slowest.
- MNLP (Green) and Length (Orange) plateau early.
- BU (Purple) selects the most beneficial samples for the model at exactly the right time, leading to a steeper learning curve.
Generalization to Other Tasks
Is this method specific only to Chinese Word Segmentation? The authors tested it on Named Entity Recognition (NER) using Chinese (Weibo, Note4) and English (CoNLL-2003) datasets.

As shown above, the DCL framework with the BU metric (bottom row) outperforms both the standard BERT model and BERT with simple Curriculum Learning (Length). This proves the framework is robust across different sequence labeling tasks and languages.
Conclusion
The research presented in “An Effective Curriculum Learning for Sequence Labeling Incorporating Heterogeneous Knowledge” offers a compelling argument for treating AI training more like human education.
By implementing a Dual-stage Curriculum Learning framework, we can mitigate the difficulties caused by heterogeneous data and complex architectures. The combination of a “Teacher” for initialization and a “Student” for dynamic adjustment creates a powerful feedback loop. Furthermore, using Bayesian Uncertainty as a difficulty metric proves to be far superior to simple heuristics like sentence length.
For students and practitioners in NLP, the takeaways are clear:
- Order Matters: Random batching is not always optimal.
- Dynamic is Better: What is “hard” for a model changes as it learns; your curriculum should reflect that.
- Uncertainty is Useful: Measuring what a model doesn’t know is a powerful signal for guiding its training.
This approach not only yields higher F1 scores but also saves significant computational resources—a win-win for modern deep learning.
](https://deep-paper.org/en/paper/2402.13534/images/cover.png)