If you look at the recent history of Large Language Models (LLMs)—from GPT-3 to Llama 3 to Mistral—you will notice something interesting. The model architectures haven’t changed all that much. They are mostly variants of the Transformer decoder. What has changed is the scale and, more importantly, the data.
We have entered an era where “data engineering” is as critical as “model engineering.” Yet, the recipes for constructing the massive, multi-trillion token datasets required to train these models are often guarded as trade secrets. How exactly do you filter 100 terabytes of web text? How do you balance Python code against English literature?
In the paper “Data, Data Everywhere: A Guide for Pretraining Dataset Construction,” researchers from NVIDIA lift the veil on this process. They conducted a systematic study across the entire data pipeline—curation, selection, and sampling—to determine what actually improves model performance.
This post will walk you through their findings, effectively giving you a blueprint for building a modern pretraining dataset.
The Pipeline: From Raw Crawl to Refined Fuel
Building a dataset isn’t just about downloading the internet. It is a manufacturing pipeline. The authors define this process in four distinct stages, as illustrated below.

- Raw Data Sources: Collecting massive dumps of text (Web Crawl, Books, News, Code).
- Data Curation: Cleaning the data by removing “garbage,” duplicates, and ill-formed text.
- Data Selection: Identifying and keeping only the highest-value documents.
- Data Sampling: Deciding the ratios (weights) in which the model sees different data sources during training.
Let’s break down each stage and look at the ablation studies (experiments where they test one variable at a time) to see what works.
1. Data Curation: Cleaning the Noise
The internet is messy. It contains navigation menus, error messages, and massive amounts of duplicated content. If you feed this raw sludge to a Transformer, the model wastes compute learning noise.
The researchers applied two major filters:
- Deduplication: Removing identical or near-identical documents.
- Quality Filtering: Using heuristics (rule-based) and classifiers (model-based) to remove low-quality text.
The “KenLM” Filter
A popular technique for quality filtering involves training a lightweight language model (KenLM) on “gold standard” data (like Wikipedia or Books). You then measure the perplexity of your web documents against this model. If a web document has high perplexity (it looks very different from Wikipedia), it’s likely low quality.
Does this actually work? The researchers compared their quality classifier against KenLM perplexity scores:

As shown in Figure 12, there is a strong correlation. Documents with low perplexity (0-100) are overwhelmingly classified as “Medium” or “High” quality. As perplexity rises (the text becomes more random or chaotic), the quality drops.
The “Old vs. New” Dilemma in Deduplication
When you have two identical documents—one from a 2015 snapshot and one from 2023—which one do you keep?
Intuition suggests keeping the newer one to stay current. However, the researchers found the opposite. Prioritizing older documents resulted in better model performance. The hypothesis is that older content has had more time to be curated or linked to, or perhaps the “golden age” of the distinct web content predates the explosion of SEO-spam and generated content.
Key Takeaway: Always deduplicate and filter. When deduplicating, prefer older sources.
2. Data Selection: Finding the Needle in the Haystack
After cleaning, you still have too much data. You want to select the documents that will teach the model the most. The researchers analyzed DSIR (Domain Selection via Importance Resampling).
DSIR is a statistical method. You give it a “target” distribution (e.g., “I want my data to look like Wikipedia and Books”) and a raw dataset (Common Crawl). DSIR assigns an importance weight to every document in the raw set based on how similar it is to the target.
Does it work?
Yes, but with caveats.
- Scope Matters: Running DSIR on each data source individually (e.g., filtering the “News” pile separately from the “Blogs” pile) works better than running it on the whole corpus at once.
- Target Sensitivity: The definition of “good data” changes everything. When researchers added arXiv papers to their target set, performance on general reasoning actually fluctuated.
This implies that while algorithmic selection is powerful, it is highly sensitive to what you define as your “ideal” data profile.
3. Data Sampling: The Art of Mixing
You have cleaned, filtered piles of data: English text, Multilingual text, and Code. How do you mix them? If you simply combine them by file size, the sheer volume of the Web Crawl will drown out everything else.
The paper compares three mixing strategies:
- User Preference: Manually tuning weights based on “vibes” or intuition.
- DoReMi: A method that uses a small proxy model to learn optimal weights.
- UniMax: A heuristic that caps the number of times the model sees any single data source (epochs) to ensure diversity.
- Alpha Sampling: A smoothing technique where you take the square root (or other power) of the dataset size to give smaller datasets a boost.
The Results
For English Text: UniMax was the clear winner. By capping the epochs (limiting repeats of small datasets), it prevents overfitting while ensuring the model sees enough volume.
For Code: Alpha Sampling (with \(\alpha = 0.3\)) won. Code datasets are highly imbalanced (billions of lines of Javascript vs. very little Haskell). Alpha sampling boosts the low-resource languages enough to help the model generalize without overfitting.
The DoReMi Failure: Interestingly, the learned method, DoReMi, performed poorly. Look at the weight distributions below to see why:

In Figure 6 (English), notice the brown line (DoReMi). It assigns massive weight to specific datasets (like OpenWebText2) while ignoring others. It overfits to the proxy model’s biases. In contrast, UniMax (red lines) distributes weights more evenly.
The same pattern appears in Multilingual data:

In Figure 8, DoReMi (black dotted line) behaves erratically, while UniMax and Alpha sampling provide a smoother distribution across languages.
For Code (Figure 9 below), DoReMi put over 60% of the training weight on Markdown alone, which is obviously suboptimal for learning Python or C++.

Key Takeaway: Don’t overcomplicate sampling with learned models like DoReMi. Use UniMax for general text and Alpha Sampling for domain-heavy data like code.
4. Deep Dive: What is the Internet Made Of?
One of the paper’s most valuable contributions is a granular analysis of Common Crawl, the backbone of almost every LLM dataset. The researchers trained classifiers to tag millions of documents by Type of Speech, Domain, Quality, and Toxicity.
Types of Speech
What are we actually reading on the web?

As Figure 2 shows, the web is dominated by Websites (homepages, commercial info), News, and Blogs. Surprisingly, Conversational text is rare (under 1%). This explains why base models (before instruction tuning) are good at completing articles but bad at chatting—they haven’t seen much dialogue during pretraining.
Domains
What topics are covered?

Figure 3 reveals a bias toward Arts & Entertainment and Sports. Crucial technical domains like Science, Law, and Finance are rare. If you want an LLM to be a legal expert, you cannot rely on the web crawl alone; you must curate specific legal datasets.
Quality and Toxicity
The researchers found that the vast majority of the web is “Medium” quality. High-quality text is rare.

However, the news regarding toxicity is better than expected. Most web documents fall into the lowest toxicity buckets:

The “High Quality” Paradox
Here is a fascinating insight: Where do we find the highest quality data?

Figure 13 shows that Explanatory Articles and News have the highest density of high-quality text. “Online Comments,” unsurprisingly, are often low quality.
But there is a catch. The researchers discovered a correlation between High Quality and Toxicity in certain domains.

While “Adult” content is toxic (as expected), the domain “Sensitive Subjects” also flags as toxic. However, these sensitive subjects often contain high-quality news reporting on war, politics, or crime. If you blindly filter out all “toxic” scores, you accidentally strip your dataset of high-quality journalism and historical records.
The Texture of the Web
Finally, we can view the web as a heatmap of Domain vs. Type of Speech.

Figure 14 confirms our suspicions: “Sports” is mostly News. “Science” is mostly Explanatory Articles. “Adult” content is effectively its own isolated cluster.
Putting It All Together: Attribute-Based Construction
The authors didn’t just analyze these attributes for fun; they used them to build a better dataset.
By bucketing data not just by source (e.g., “Common Crawl”), but by attribute (e.g., “Common Crawl - Science - High Quality”), they could refine the sampling process.
They found that Sampling weights based on attribute buckets significantly improved performance. Instead of treating “The Web” as one block, they upsampled the “High Quality” and “Science/Finance” buckets and downsampled the “Low Quality/Boilerplate” buckets.
They also used these attributes for Data Selection. Instead of a generic target, they defined a target set of “Low Toxicity AND High Quality” documents. This allowed them to filter out toxic sludge without losing the high-quality “Sensitive Subjects” discussed earlier.
Conclusion
The “secret sauce” of LLMs is less about magic and more about rigorous engineering. This paper provides a clear set of actionable steps for practitioners:
- Curate aggressively: Deduplicate (keeping old files) and filter by perplexity.
- Don’t trust learned sampling: Simple heuristics like UniMax and Alpha Sampling usually generalize better than optimization-based methods like DoReMi.
- Know your data: You cannot optimize what you cannot measure. Classifying your massive dataset by domain and quality allows you to surgically upsample the knowledge your model lacks (like Law or Science) and downsample the noise.
- Chat data is scarce: The web is not conversational. If you want a chatbot, you need to find or synthesize conversational data, because the web crawl won’t provide it.
By following this “Guide for Pretraining Dataset Construction,” we move away from alchemy and toward a reproducible science of data.
](https://deep-paper.org/en/paper/2407.06380/images/cover.png)