The Missing Manual: How to Build a Trillion-Token Pretraining Dataset

If you look at the recent history of Large Language Models (LLMs)—from GPT-3 to Llama 3 to Mistral—you will notice something interesting. The model architectures haven’t changed all that much. They are mostly variants of the Transformer decoder. What has changed is the scale and, more importantly, the data.

We have entered an era where “data engineering” is as critical as “model engineering.” Yet, the recipes for constructing the massive, multi-trillion token datasets required to train these models are often guarded as trade secrets. How exactly do you filter 100 terabytes of web text? How do you balance Python code against English literature?

In the paper “Data, Data Everywhere: A Guide for Pretraining Dataset Construction,” researchers from NVIDIA lift the veil on this process. They conducted a systematic study across the entire data pipeline—curation, selection, and sampling—to determine what actually improves model performance.

This post will walk you through their findings, effectively giving you a blueprint for building a modern pretraining dataset.

The Pipeline: From Raw Crawl to Refined Fuel

Building a dataset isn’t just about downloading the internet. It is a manufacturing pipeline. The authors define this process in four distinct stages, as illustrated below.

Figure 1: Each step in the development process to go from a collection of data sources into a final pretraining set that produces a highly capable LM.

Raw Data Sources: Collecting massive dumps of text (Web Crawl, Books, News, Code).
Data Curation: Cleaning the data by removing “garbage,” duplicates, and ill-formed text.
Data Selection: Identifying and keeping only the highest-value documents.
Data Sampling: Deciding the ratios (weights) in which the model sees different data sources during training.

Let’s break down each stage and look at the ablation studies (experiments where they test one variable at a time) to see what works.

1. Data Curation: Cleaning the Noise

The internet is messy. It contains navigation menus, error messages, and massive amounts of duplicated content. If you feed this raw sludge to a Transformer, the model wastes compute learning noise.

The researchers applied two major filters:

Deduplication: Removing identical or near-identical documents.
Quality Filtering: Using heuristics (rule-based) and classifiers (model-based) to remove low-quality text.

The “KenLM” Filter

A popular technique for quality filtering involves training a lightweight language model (KenLM) on “gold standard” data (like Wikipedia or Books). You then measure the perplexity of your web documents against this model. If a web document has high perplexity (it looks very different from Wikipedia), it’s likely low quality.

Does this actually work? The researchers compared their quality classifier against KenLM perplexity scores:

Figure 12: There is high correlation between the quality classifier and the perplexity of a KenLM model used for quality filtering during data curation.

As shown in Figure 12, there is a strong correlation. Documents with low perplexity (0-100) are overwhelmingly classified as “Medium” or “High” quality. As perplexity rises (the text becomes more random or chaotic), the quality drops.

The “Old vs. New” Dilemma in Deduplication

When you have two identical documents—one from a 2015 snapshot and one from 2023—which one do you keep?

Intuition suggests keeping the newer one to stay current. However, the researchers found the opposite. Prioritizing older documents resulted in better model performance. The hypothesis is that older content has had more time to be curated or linked to, or perhaps the “golden age” of the distinct web content predates the explosion of SEO-spam and generated content.

Key Takeaway: Always deduplicate and filter. When deduplicating, prefer older sources.

2. Data Selection: Finding the Needle in the Haystack

After cleaning, you still have too much data. You want to select the documents that will teach the model the most. The researchers analyzed DSIR (Domain Selection via Importance Resampling).

DSIR is a statistical method. You give it a “target” distribution (e.g., “I want my data to look like Wikipedia and Books”) and a raw dataset (Common Crawl). DSIR assigns an importance weight to every document in the raw set based on how similar it is to the target.

Does it work?

Yes, but with caveats.

Scope Matters: Running DSIR on each data source individually (e.g., filtering the “News” pile separately from the “Blogs” pile) works better than running it on the whole corpus at once.
Target Sensitivity: The definition of “good data” changes everything. When researchers added arXiv papers to their target set, performance on general reasoning actually fluctuated.

This implies that while algorithmic selection is powerful, it is highly sensitive to what you define as your “ideal” data profile.

3. Data Sampling: The Art of Mixing

You have cleaned, filtered piles of data: English text, Multilingual text, and Code. How do you mix them? If you simply combine them by file size, the sheer volume of the Web Crawl will drown out everything else.

The paper compares three mixing strategies:

User Preference: Manually tuning weights based on “vibes” or intuition.
DoReMi: A method that uses a small proxy model to learn optimal weights.
UniMax: A heuristic that caps the number of times the model sees any single data source (epochs) to ensure diversity.
Alpha Sampling: A smoothing technique where you take the square root (or other power) of the dataset size to give smaller datasets a boost.

The Results

For English Text: UniMax was the clear winner. By capping the epochs (limiting repeats of small datasets), it prevents overfitting while ensuring the model sees enough volume.

For Code: Alpha Sampling (with \(\alpha = 0.3\)) won. Code datasets are highly imbalanced (billions of lines of Javascript vs. very little Haskell). Alpha sampling boosts the low-resource languages enough to help the model generalize without overfitting.

The DoReMi Failure: Interestingly, the learned method, DoReMi, performed poorly. Look at the weight distributions below to see why:

Figure 6: Returned samplings weights for the English dataset.

In Figure 6 (English), notice the brown line (DoReMi). It assigns massive weight to specific datasets (like OpenWebText2) while ignoring others. It overfits to the proxy model’s biases. In contrast, UniMax (red lines) distributes weights more evenly.

The same pattern appears in Multilingual data:

Figure 8: Returned samplings weights for the Multilingual dataset.

In Figure 8, DoReMi (black dotted line) behaves erratically, while UniMax and Alpha sampling provide a smoother distribution across languages.

For Code (Figure 9 below), DoReMi put over 60% of the training weight on Markdown alone, which is obviously suboptimal for learning Python or C++.

Figure 9: Returned samplings weights for the Code dataset.

Key Takeaway: Don’t overcomplicate sampling with learned models like DoReMi. Use UniMax for general text and Alpha Sampling for domain-heavy data like code.

4. Deep Dive: What is the Internet Made Of?

One of the paper’s most valuable contributions is a granular analysis of Common Crawl, the backbone of almost every LLM dataset. The researchers trained classifiers to tag millions of documents by Type of Speech, Domain, Quality, and Toxicity.

Types of Speech

What are we actually reading on the web?

Figure 2: Distribution of document types in web crawl.

As Figure 2 shows, the web is dominated by Websites (homepages, commercial info), News, and Blogs. Surprisingly, Conversational text is rare (under 1%). This explains why base models (before instruction tuning) are good at completing articles but bad at chatting—they haven’t seen much dialogue during pretraining.

Domains

What topics are covered?

Figure 3: Distribution of content domains in web crawl.

Figure 3 reveals a bias toward Arts & Entertainment and Sports. Crucial technical domains like Science, Law, and Finance are rare. If you want an LLM to be a legal expert, you cannot rely on the web crawl alone; you must curate specific legal datasets.

Quality and Toxicity

The researchers found that the vast majority of the web is “Medium” quality. High-quality text is rare.

Figure 10: Breakdown of document quality across web crawl snapshots.

However, the news regarding toxicity is better than expected. Most web documents fall into the lowest toxicity buckets:

Figure 11: Breakdown of document toxicity across web crawl snapshots.

The “High Quality” Paradox

Here is a fascinating insight: Where do we find the highest quality data?

Figure 13: Types of speech sorted by descending order of percentage of high quality documents.

Figure 13 shows that Explanatory Articles and News have the highest density of high-quality text. “Online Comments,” unsurprisingly, are often low quality.

But there is a catch. The researchers discovered a correlation between High Quality and Toxicity in certain domains.

Figure 5: Heatmap of domains by probability of toxic content. Adult and online communities contain the highest percentage of toxic content.

While “Adult” content is toxic (as expected), the domain “Sensitive Subjects” also flags as toxic. However, these sensitive subjects often contain high-quality news reporting on war, politics, or crime. If you blindly filter out all “toxic” scores, you accidentally strip your dataset of high-quality journalism and historical records.

The Texture of the Web

Finally, we can view the web as a heatmap of Domain vs. Type of Speech.

Figure 14: Heatmap of domains by types of speech.

Figure 14 confirms our suspicions: “Sports” is mostly News. “Science” is mostly Explanatory Articles. “Adult” content is effectively its own isolated cluster.

Putting It All Together: Attribute-Based Construction

The authors didn’t just analyze these attributes for fun; they used them to build a better dataset.

By bucketing data not just by source (e.g., “Common Crawl”), but by attribute (e.g., “Common Crawl - Science - High Quality”), they could refine the sampling process.

They found that Sampling weights based on attribute buckets significantly improved performance. Instead of treating “The Web” as one block, they upsampled the “High Quality” and “Science/Finance” buckets and downsampled the “Low Quality/Boilerplate” buckets.

They also used these attributes for Data Selection. Instead of a generic target, they defined a target set of “Low Toxicity AND High Quality” documents. This allowed them to filter out toxic sludge without losing the high-quality “Sensitive Subjects” discussed earlier.

Conclusion

The “secret sauce” of LLMs is less about magic and more about rigorous engineering. This paper provides a clear set of actionable steps for practitioners:

Curate aggressively: Deduplicate (keeping old files) and filter by perplexity.
Don’t trust learned sampling: Simple heuristics like UniMax and Alpha Sampling usually generalize better than optimization-based methods like DoReMi.
Know your data: You cannot optimize what you cannot measure. Classifying your massive dataset by domain and quality allows you to surgically upsample the knowledge your model lacks (like Law or Science) and downsample the noise.
Chat data is scarce: The web is not conversational. If you want a chatbot, you need to find or synthesize conversational data, because the web crawl won’t provide it.

By following this “Guide for Pretraining Dataset Construction,” we move away from alchemy and toward a reproducible science of data.

The Pipeline: From Raw Crawl to Refined Fuel#

1. Data Curation: Cleaning the Noise#

The “KenLM” Filter#

The “Old vs. New” Dilemma in Deduplication#

2. Data Selection: Finding the Needle in the Haystack#

Does it work?#

3. Data Sampling: The Art of Mixing#

The Results#

4. Deep Dive: What is the Internet Made Of?#

Types of Speech#

Domains#

Quality and Toxicity#

The “High Quality” Paradox#

The Texture of the Web#

Putting It All Together: Attribute-Based Construction#

Conclusion#