In the world of Machine Learning, there is a pervasive mantra: “More data is better.” If your model isn’t performing well, the standard advice is often to throw more training samples at it. But in specialized fields like Natural Language Processing (NLP), acquiring high-quality data is neither easy nor cheap.
This is particularly true for Topic-Dependent Argument Mining (TDAM). Teaching a machine to recognize whether a specific sentence is an argument for or against a complex topic (like “nuclear energy” or “minimum wage”) requires nuance. You cannot simply scrape the web and hope for the best; you usually need human experts to label the data. This process is expensive and time-consuming.
So, what if we have been building datasets the wrong way?
A fascinating research paper titled “Diversity Over Size” challenges the “more is better” status quo. The researchers investigate whether the composition of a dataset—specifically the diversity of topics—matters more than the sheer number of samples. Their findings are surprising: by prioritizing topic diversity, we might be able to reduce dataset sizes by nearly 90% while retaining 95% of the performance.
In this post, we will break down their methodology, the creation of a new benchmark dataset, and what this means for the future of efficient model training.
The Problem: The High Cost of Argument Mining
To understand the researchers’ motivation, we first need to define the task. Topic-Dependent Argument Mining (TDAM) involves searching through documents to find arguments relevant to a specific query or topic.
For example, if the topic is “electronic cigarettes,” the model needs to look at a sentence like “Currently, there is no scientific evidence confirming that electronic cigarettes help smokers quit smoking” and classify it. Is it a “Pro” argument? A “Con” argument? Or is it “None” (not an argument)?
This is harder than standard sentiment analysis because it requires context. A sentence that sounds negative might actually support a specific policy, and vice versa. Consequently, creating datasets for TDAM is costly. The researchers note that a previous study spent over $2,700 just to annotate about 25,000 samples.
As transformer models (like BERT and GPT) have evolved, the hunger for larger datasets has grown. But relying on massive datasets has three major downsides:
- Impracticality: It is almost impossible for experts to label hundreds of thousands of samples.
- Cost: Crowdsourcing is expensive and requires strict quality control.
- Training Time: More data means longer, more expensive training cycles.
The researchers propose a different path: Diversity Sampling. Instead of collecting thousands of examples for a handful of topics, what if we collected a few examples for hundreds of topics?
The Data: Introducing the FS150T-Corpus
To test their hypothesis, the researchers needed a controlled environment. They couldn’t just use existing datasets because those datasets often vary in too many ways (different sources, different labeling guidelines).
So, they built a new dataset called the FS150T-Corpus (Few-Shot 150 Topics Corpus).
The goal was to create a dataset that was comparable to a standard benchmark—the UKP Corpus—but with a radically different structure.
- UKP Corpus (Standard): Contains 8 topics with thousands of samples per topic.
- FS150T-Corpus (New): Contains 150 topics, but only 144 samples per topic.
Both datasets have roughly the same total number of training samples (~17,000), but the distribution is completely different. The FS150T-Corpus is designed for “few-shot” learning scenarios, where a model must learn to generalize from a small amount of data.
The researchers also utilized two other datasets for validation: the IAM-Corpus and the IBM-Corpus. You can see the breakdown of these datasets below. Note how the FS150T-Corpus has a massive number of topics compared to the UKP Corpus, despite similar total sizes.

To give you a better sense of what this data actually looks like, let’s examine a few samples. The task requires the model to read a Topic and a Sentence, and then assign a Class (label).

As shown in the table above, the arguments are complex. For the topic “nuclear energy,” the sentence discussing the expense of mining uranium is labeled as “contra.” The model must understand the economic implication to make that classification.
The Core Method: Experimenting with Size and Diversity
The researchers set up a series of experiments to answer three questions:
- Sample Experiments: How many samples per topic do we actually need?
- Topic Experiments: How much does adding new topics help the model generalize?
- Dataset Experiments: Which dataset structure creates a better model overall?
They employed four different models to test these scenarios:
- ERNIE 2.0: A medium-sized language model (110M parameters) pre-trained on tasks similar to argument mining. This represents a standard, efficient supervised learning approach.
- FLAN-T5 XL: A large language model (LLM) with 1.3B parameters, fine-tuned on instructions.
- Llama2-70B & ChatGPT: Massive state-of-the-art LLMs used in a “zero-shot” setting (asking the model to classify without training it on the specific data).
1. Sample Experiments
In this phase, the researchers incrementally increased the number of training samples to see how quickly the models improved.
The results on the new FS150T-Corpus are visualized below.

Key Takeaways from Figure 1:
- The Rise of ERNIE (Blue Line): The medium-sized model, ERNIE 2.0, learns incredibly fast. With very few samples, it shoots up in accuracy (F1 score).
- The Plateau: Notice that after a certain point, adding more data yields diminishing returns. The line flattens out.
- Fine-Tuning vs. Zero-Shot: The green and red lines represent Llama2 and ChatGPT (zero-shot). While they start strong (since they don’t need training), the fine-tuned smaller models (ERNIE and FLAN-T5) eventually surpass them. This proves that for specialized tasks like TDAM, you still need some training data to beat the generic giants.
The researchers found a similar pattern on the other datasets, such as the IBM-Corpus (below).

On the IBM-Corpus, the gap is even more pronounced. ERNIE 2.0 (blue) reaches high performance very quickly.
The Efficiency Discovery: The researchers calculated how many samples were needed to reach “acceptable” performance (defined as 95% of the model’s maximum potential).
- For the FS150T-Corpus, ERNIE 2.0 only needed 11% of the data (about 1,920 samples total).
- This implies that nearly 90% of the annotation effort for a traditional dataset layout might be wasted money.
2. Topic Experiments
This is the heart of the “Diversity Over Size” argument. The researchers fixed the number of samples and steadily increased the number of topics those samples were drawn from.
If “topics” didn’t matter, the performance should stay the same regardless of whether the 1,000 samples came from 5 topics or 50 topics. But that is not what happened.

Look at the top row of Figure 4 (ERNIE 2.0). There is a consistent upward trend.
- Interpretation: As the number of topics (x-axis) increases, the model’s F1 score (y-axis) improves, even though the total amount of training data is fixed.
- Robustness: By seeing arguments from many different domains (politics, tech, economy), the model learns a better, more general representation of what an “argument” looks like. It stops memorizing specific keywords related to a single topic and starts understanding argumentative structure.
3. Dataset Experiments
Finally, the researchers pitted the two dataset philosophies against each other.
- Model A: Trained on the UKP Corpus (Few topics, many samples).
- Model B: Trained on the FS150T-Corpus (Many topics, few samples).
They then tested both models on both datasets. This is a “cross-dataset” evaluation, which is the ultimate test of generalization.

The Verdict (Table 4): Look at the results for FLAN-T5 XL.
- When trained on the UKP Corpus and tested on itself, it scores .7881.
- When trained on the FS150T-Corpus and tested on the UKP Corpus (a dataset it has never seen), it scores .8270.
This is a stunning result. Training on the diverse, shallow dataset (FS150T) actually produced a better model for the UKP task than training on the UKP data itself. The diverse dataset taught the model to generalize so well that it outperformed a model trained on the specific target distribution.
Detailed Analysis: Stability and Baselines
It is worth noting that while the medium-sized model (ERNIE) benefited consistently from diversity, the larger model (FLAN-T5 XL) was a bit more volatile with small sample sizes.
The researchers found that FLAN-T5 XL struggled initially with very low data but skyrocketed once it had enough samples. However, purely in terms of stability, the smaller ERNIE model was surprisingly robust.
The researchers also compared their results against strong baselines. Below is a detailed look at the sample experiments including standard deviations.

In Figure 5, the blue line (ERNIE) is consistently high and stable (narrow shaded area). The orange line (FLAN-T5) starts lower but climbs high. The flat lines (Green/Red) are the zero-shot models. They are consistent, but they hit a “performance ceiling” that they cannot break through without fine-tuning.
This reinforces the economic argument: If you want state-of-the-art performance, you must fine-tune. And if you must fine-tune, you should use a diverse dataset.
Conclusion and Implications
This research papers offers a crucial pivot point for how we think about data collection in NLP.
The “Diversity Over Size” principle suggests:
- Stop over-annotating: You do not need 3,000 examples of arguments about “gun control.” After a few hundred, the model has likely learned all it can about that specific topic.
- Spread the budget: Instead of paying annotators to label more of the same, spend that budget on finding new, distinct topics.
- Efficiency: By following this method, you can build a dataset with 10-15% of the usual sample count (saving thousands of dollars) while achieving 95% of the maximum performance.
For Students and Practitioners: If you are designing a machine learning project, especially one involving high-level reasoning like argument mining or stance detection, do not obsess over the volume of data. Focus on the variety of your data.
A model trained on a “shallow but wide” ocean of topics will be smarter, more robust, and more adaptable than one trained on a “deep but narrow” well of information. The researchers have not only provided the code and the FS150T dataset to the community, but they have also provided a blueprint for low-budget, high-performance AI development.
](https://deep-paper.org/en/paper/2205.11472/images/cover.png)