How to Filter the Web: Using 'Bad' Models to Find Good Data

The race to build better Large Language Models (LLMs) is often viewed as a race for more data. This is driven by the “scaling laws,” which suggest that model performance correlates directly with the size of the training corpus and the model parameters. However, recent developments in the field have added a crucial nuance to this rule: it’s not just about the quantity of data; it is equally, if not more, about the quality.

Imagine trying to learn a new language by reading every scrap of paper found on the street—from classic novels to discarded candy wrappers and scam flyers. You would learn the language, but you would also learn a lot of noise. This is the state of the raw web. To train state-of-the-art models like Llama 3 or GPT-4, engineers must filter trillions of tokens of web data to separate the educational content from the “junk.”

Today, we are doing a deep dive into a fascinating research paper titled “Rethinking KenLM” by researchers at Upstage AI. They propose a clever, efficient solution to this filtering problem. Instead of just looking for good data, they train a model specifically to recognize bad data, and then pit the two against each other.

The Bottleneck: Quality vs. Efficiency

Before understanding the solution, we must understand the problem. Filtering a dataset the size of the internet is a logistical nightmare.

Modern filtering pipelines often use heavy, GPU-dependent methods. For example, you might use a BERT-based classifier or finetune a small LLM to score documents. While effective, these methods are computationally expensive. When you are processing petabytes of text, firing up a cluster of GPUs just for data cleaning is often cost-prohibitive.

Enter KenLM

To solve the efficiency problem, the industry standard has been KenLM. KenLM is a library that implements n-gram language models.

What is an n-gram model? It’s a statistical model that predicts the next word in a sequence based on the previous \(n-1\) words. If \(n=3\) (a trigram), and the previous words are “the cat,” the model looks at the probability of “sat,” “ate,” or “ran” coming next.
Why use it? It is incredibly lightweight and runs entirely on CPUs. It requires a fraction of the compute power of a neural network.

The standard approach (used by major datasets like CCNet and RefinedWeb) is to train a KenLM on high-quality data, typically Wikipedia. When filtering new web data, you calculate its Perplexity (PPL) using this model.

Low Perplexity: The text looks like Wikipedia (likely high quality).
High Perplexity: The text looks strange or unfamiliar (likely low quality).

The Flaw in the Standard Approach

The researchers identify a critical weakness in this “Wikipedia-only” approach. Standard KenLM filtering is one-sided. It explicitly learns what “good” text looks like, but it does not explicitly learn what “bad” text looks like.

As a result, it suffers from false positives. Text that is grammatically correct but semantically garbage—like a well-written spam email or a polite advertisement—might get a low perplexity score simply because it uses standard sentence structures. The model thinks, “Well, the grammar is fine, so it must be good,” failing to realize that the content itself is undesirable for training an LLM.

The Core Method: The Good, The Bad, and The Ensemble

The authors propose a “Good and Bad” ensemble approach. Instead of relying on a single perspective, they use two contrasting models to evaluate text.

1. The Good KenLM

First, they upgraded the standard “Good” model. While Wikipedia is great, it doesn’t cover everything. Scientific papers and textbooks are arguably better sources for reasoning and logic.

The authors trained their Good KenLM on a combination of:

S2ORC: A massive corpus of science and academic papers.
Textbooks: Specifically, the “Textbooks-are-all-you-need-lite” dataset.

The goal of this model is to assign low perplexity scores to high-quality, well-reasoned text.

2. The Bad KenLM

This is the novel contribution. The researchers asked: Why not train a model specifically on the data we want to remove?

They created a Bad KenLM trained on distinct “noise” datasets:

Spam: Emails and automated messages.
SNS (Social Networking Services): Informal text from platforms like Twitter (now X), containing hashtags, slang, and fragmented sentences.
Hate Speech/Toxic Comments: (Though, as we will see in the experiments, this had mixed results).

The goal of this model is to assign low perplexity scores to garbage. If a document gets a low score from the Bad KenLM, it means “I recognize this! This looks like the spam I was trained on.”

3. The Ensemble Strategy

Now we have two scores for every document:

\(P_{good}\): How much does this look like a textbook? (Lower is better)
\(P_{bad}\): How much does this look like spam? (Lower is worse)

You cannot simply add these numbers because the two models are trained on different data distributions. A “low” score for one model might be numerically different from a “low” score for the other.

To solve this, the authors use Z-score standardization. They calculate the mean (\(\mu\)) and standard deviation (\(\sigma\)) for both models to normalize the scores.

The final Ensembled Score is calculated using this equation:

The Ensemble Equation for combining Good and Bad KenLM scores.

Let’s break down this math (Equation 1 from the paper):

Term 1 (Good): We take the normalized Good PPL. If the text is high quality, this Z-score is low (negative).
Term 2 (Bad): We take the normalized Bad PPL. If the text is high quality, it should look unfamiliar to the Bad model, resulting in a high (positive) Z-score.
The Subtraction: The equation subtracts the Bad score.
If text is High Quality: Good Score is Low (e.g., -2) and Bad Score is High (e.g., +2).
Result: \((-2) - (+2) = -4\). (Very Low Score)
If text is Spam: Good Score might be average (e.g., 0) because spam can be grammatical, but Bad Score is Low (e.g., -2).
Result: \((0) - (-2) = +2\). (High Score)

The parameter \(\alpha\) (alpha) allows the researchers to weight the importance of the Good model versus the Bad model.

By combining these views, the metric pushes high-quality data to the bottom of the scale and noisy data to the top, creating a much clearer separation than either model could achieve alone.

Experiments and Results

To prove this works, the authors used a massive web dump (CC-MAIN-2024-10 from Fineweb-edu) containing 211 million samples. They used “educational scores” (generated by a high-quality, computationally expensive classifier) as the ground truth. If a text had a high educational score, it was considered a “true positive.”

RQ1: Does it beat the baseline?

The results were impressive. They compared their method against the standard Wikipedia KenLM and even FastText, a popular classifier-based filtering method.

Table 1 comparing performance metrics of different models.

Looking at Table 1:

Wiki KenLM (The Standard): achieved a Recall@30 of 0.5530. This means it missed nearly half of the high-quality data in the top 30% of filtered results.
Good KenLM: Simply changing the training data from Wikipedia to Textbooks/Science improved performance to 0.7059. This confirms that Wikipedia isn’t the “be-all and end-all” of good data.
Ens(Good, Bad): The proposed ensemble reached a Recall@30 of 0.8190.

The most striking result here is that the Ensemble (Good, Bad) outperformed FastText. FastText is a supervised classifier that usually performs better than n-gram models but is slightly heavier. Beating FastText with simple n-gram models on CPUs is a significant efficiency win.

RQ2: What constitutes “Bad” Data?

You might assume that training the “Bad” model on the worst possible text (hate speech, toxic comments) would yield the best results. However, the experiments showed something different.

Table 2 showing the effect of different data sources on Bad KenLM.

Table 2 reveals an interesting insight:

Toxic data performed poorly. Adding toxic data actually hurt performance compared to using just Spam or Twitter data.
Why? Toxic datasets often contain extreme profanity or very specific types of hate speech. This distribution is an outlier. It is so different from standard web text that the model learns a very narrow definition of “bad.”
Twitter & Spam are King. Social media and spam represent the general noise of the internet—informal grammar, ads, fragments, and irrelevant chatter. This is exactly the kind of “average” noise that pollutes web corpora.

RQ3: The Balancing Act (\(\alpha\))

How much should we trust the Good model vs. the Bad model? This is determined by the \(\alpha\) parameter in the equation.

Figure 1 showing the effect of alpha on performance.

Figure 1 shows the performance curve.

If \(\alpha\) is too low (trusting the Bad model too much), performance drops. The Bad model might be too aggressive.
If \(\alpha\) is too high (trusting the Good model too much), you lose the benefit of the ensemble.
The Sweet Spot: The performance peaks around \(\alpha = 0.7\). This suggests that the “Good” signal is the primary driver, but the “Bad” signal provides a critical 30% correction to filter out the tricky cases.

RQ4: Is it efficient?

The main argument for using KenLM over neural networks is speed and cost. Does running two models ruin this advantage?

Table 3 comparing computational overhead and cost.

Table 3 breaks down the cost.

Time: The ensemble takes about 1.7x longer than a single model (3,928s vs 2,234s).
Cost: Processing this dump cost \(2.50** with the ensemble, compared to **\)1.42 with the single model.

While the cost effectively doubles, it is still negligible. We are talking about processing hundreds of millions of documents for the price of a cup of coffee. Compared to the thousands of dollars required to run GPU-based classifiers on this volume of data, the ensemble approach remains incredibly cost-effective.

RQ5: What does it actually catch?

Statistics are great, but seeing is believing. What kind of text does the Good model miss that the Ensemble catches?

Figure 2 visualizing examples filtered by the ensemble.

Figure 2 provides qualitative examples.

The “Polite” Spam: Look at the first example about “Online roulette.” The sentence structure is fine: “Online roulette for real money Hungary withdrawals can take up to 3 days…”
A model trained only on “Good” data might see the grammar and think, “This looks like a valid sentence.”
The Bad KenLM, trained on spam, recognizes the vocabulary (“roulette,” “withdrawals,” “real money”) and the persuasive tone, flagging it immediately.
Communication Style: The second example represents conversational filler often found on forums. It’s not educational, but it’s not grammatically “wrong” enough for a standard model to reject. The Twitter-trained Bad KenLM, however, knows exactly what this looks like.

Conclusion and Implications

The paper “Rethinking KenLM” offers a compelling reminder that in machine learning, defining the negative space is often as important as defining the positive.

By explicitly modeling “bad” data using noise from social media and spam folders, the researchers achieved state-of-the-art filtering performance on CPU hardware. This democratizes high-quality data curation. You don’t need a massive cluster of H100 GPUs to clean your dataset; you just need a smart ensemble of lightweight models.

Key Takeaways for Students:

Don’t ignore the CPU: In an era of massive GPUs, efficient CPU algorithms like KenLM (n-grams) still have a vital role to play in data infrastructure.
Explicitly model the negative: If you want a model to reject something, train it to recognize that specific thing. Relying on “it doesn’t look like the good stuff” is often insufficient.
Data Selection Matters: The choice of training data (Twitter vs. Toxic) fundamentally changes model behavior. Understanding the distribution of your data is more important than just hoarding more of it.

As we continue to scale LLMs, techniques like this—which maximize quality while minimizing compute—will be essential for the next generation of AI development.

The Bottleneck: Quality vs. Efficiency#

Enter KenLM#

The Flaw in the Standard Approach#

The Core Method: The Good, The Bad, and The Ensemble#

1. The Good KenLM#

2. The Bad KenLM#

3. The Ensemble Strategy#

Experiments and Results#

RQ1: Does it beat the baseline?#

RQ2: What constitutes “Bad” Data?#

RQ3: The Balancing Act (\(\alpha\))#

RQ4: Is it efficient?#

RQ5: What does it actually catch?#

Conclusion and Implications#