In the rapidly evolving world of Large Language Models (LLMs), there is a saying that has become gospel: “Data is the new oil.” But anyone who works with engines knows that you cannot just pour crude oil into a Ferrari and expect it to win a race. The oil needs to be refined.

For years, the strategy for training LLMs was simply “bigger is better.” Researchers scraped the entire internet—billions upon billions of words—and fed it into massive neural networks. But as models have grown, a bottleneck has emerged. The internet is noisy, unstructured, and full of low-quality information. Training on “crude” data leads to hallucinations, poor reasoning, and inefficient learning.

Today, we are diving into a fascinating research paper titled “DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models.” The researchers propose a comprehensive system to transform raw, messy data into a pristine, high-octane fuel for AI. They call this process Decorating, and it involves three key steps: Rating, Tagging, and Editing.

The Problem: The “Garbage In, Garbage Out” Dilemma

The performance of an LLM is heavily influenced by its pretraining corpus. If you train a model on high-quality textbooks, it becomes smart and articulate. If you train it on random forum comments, it might pick up bad habits.

The challenge is scale. We are dealing with datasets containing trillions of tokens. Humans cannot manually read and label this data. Previous automated methods used simple heuristics (like filtering out short sentences) or basic classifiers (like “is this toxic?”). These methods are too coarse. They treat data as binary: either keep it or delete it. They don’t tell the model why a piece of text is good, nor do they fix the text if it’s “okay but messy.”

Enter DecorateLM

The authors introduce DecorateLM, a data engineering methodology designed to refine the pretraining corpus.

The core idea is elegant but computationally intensive. Since we cannot afford to use a massive model like GPT-4 to process the entire internet (it would be prohibitively expensive and slow), the researchers use a Teacher-Student distillation approach.

  1. The Teacher (GPT-4): Used to annotate a small, carefully selected set of data with high precision.
  2. The Student (DecorateLM): A smaller, 1.2 billion parameter model trained to mimic GPT-4’s data engineering capabilities.
  3. The Application: This efficient “Student” model then processes the massive raw corpus, “decorating” it with metadata and improvements.

Figure 1: We utilize GPT-4 to assemble an annotated training corpus and integrate data engineering expertise into DecorateLM. DecorateLM is then used to process 100 billion tokens from the raw corpus, sampling 45 billion tokens using its rating and tagging capabilities to create what we refer to as the Decorated corpus. We further enhance the Decorated corpus by applying DecorateLM’s editing features, making it more suitable for LLM training.

As shown in Figure 1 above, the process flows from a small annotated dataset to the training of the DecorateLM model, which then processes the raw corpus into a Decorated Corpus. This enhanced data is then used to train the final target LLM.

Let’s break down the three distinct “decorations” applied to the data: Rating, Tagging, and Editing.


Phase 1: Rating (Quantifying Quality)

How do you define “good” data? The researchers moved beyond simple “good vs. bad” and established eight specific criteria to score every piece of text:

  1. Educational Value: Is this suitable for a textbook?
  2. Expertise: Does it reflect deep, specialized knowledge?
  3. Fact & Trivia: Does it contain accurate factual information?
  4. Reasoning Level: Does it require logic or chain-of-thought to understand?
  5. Scarcity: Is this information niche or rare?
  6. Structural Format: Is the data well-organized (lists, markdown, etc.)?
  7. Story-likeness: Is it a narrative?
  8. Subjectivity: Is it opinion-based?

The “Pairwise” Trick

Asking an LLM to “score this text from 0 to 100” often results in inconsistent numbers. To solve this, the researchers used a pairwise comparison method. They showed GPT-4 two texts and asked, “Which one is better regarding Educational Value?” By repeating this thousands of times and using the Bradley-Terry model, they converted these wins/losses into a robust numerical score (0-100).

The DecorateLM model was trained to predict these scores. Surprisingly, the specialized DecorateLM model eventually became more consistent than GPT-4 itself on the validation set.

Figure 2: The Spearman correlations between model ratings and ground truth of validation set. Specifically, the x-axis represents the ground truth rating scores of the data. The y-axis represents the prediction rating scores of GPT-4 and DecorateLM after evaluating the validation set. Rating scores generated by GPT-4 are more discrete and inaccurate compared to DecorateLM.

In Figure 2, we see the correlation between predicted ratings and ground truth. The bottom row (DecorateLM) shows a tighter, more linear correlation compared to the top row (GPT-4’s raw predictions), proving that the small, specialized model learned the scoring task effectively.

Furthermore, these ratings are not isolated. High-quality texts often share characteristics.

Figure 3: Spearman correlation coefficients between various rating criteria. The correlations align with intuitive expectations. For instance, data with higher educational value often exhibits enhanced reasoning levels, which, in turn, enhances their comprehensibility.

As Figure 3 illustrates, there is a strong correlation (0.72) between Educational Value and Reasoning Level, which makes intuitive sense. However, Subjectivity (opinions) has a low or negative correlation with Educational Value, suggesting that objective, factual texts are generally preferred for knowledge transfer.


Phase 2: Tagging (Ensuring Diversity)

If you simply select the “highest rated” data, you might end up with a dataset entirely composed of Physics textbooks, ignoring History or Pop Culture. To prevent this, the researchers implemented a hierarchical tagging system.

They designed a taxonomy with 3 levels:

  • Level 1: 21 major categories (e.g., Natural Sciences, Arts & Culture).
  • Level 2: 255 sub-categories.
  • Level 3: 793 specific topics.

This granular tagging allows for precise control over the domain distribution of the training data.

Figure 4: Word cloud of tags. The size of each tag is proportional to its frequency in the annotated dataset. Tags are color-coded based on their levels: first-level tags in dark blue, second-level tags in medium blue, and third-level tags in light blue.

The word cloud above visualizes the breadth of topics covered. By training DecorateLM to predict these tags, the researchers could scan the raw corpus and identify exactly what topics were present.

Table 1: Comparison of tagging accuracy between DecorateLM and GPT-4 across three hierarchical levels on the validation set. GPT-4, lacking prior knowledge of the designed tagging hierarchy, is provided with the relevant labels for each level through prompts in successive rounds of interaction.

As shown in Table 1, DecorateLM achieves tagging accuracy comparable to GPT-4, even outperforming it slightly at the first level. This efficiency is crucial because it allows the system to tag billions of tokens rapidly.

Assessing the Raw Data

Using these ratings and tags, the researchers analyzed several popular open-source datasets (like Dolma, C4, and The Pile).

Figure 5: Evaluation of dataset rating and tagging quality using DecorateLM. The x-axis denotes the average rating of each dataset across specified dimensions, whereas the y-axis represents the cross-entropy of tags from predefined tagging system. The circle size correlates with the dataset size.

Figure 5 reveals a stark truth about open datasets. The English datasets (Dolma, The Pile) generally sit further to the right (higher ratings) and lower on the Y-axis (better tag consistency). Chinese datasets (like BD Wiki) showed lower quality ratings, highlighting the urgent need for data engineering in non-English corpora.


Phase 3: Editing (Transforming the Content)

This is perhaps the most innovative part of DecorateLM. Rating and Tagging allow you to select data, but Editing allows you to improve it.

The web is full of “noise”: messy HTML, weird formatting, informal slang, and disjointed logic. Even if the facts are good, the presentation might be bad for a model trying to learn language patterns.

The researchers trained an “Editing” version of DecorateLM. This model takes a raw text and rephrases it to be:

  • Clearer and more concise.
  • More logical.
  • Better formatted (using markdown, lists, etc.).
  • Formal and textbook-like.

Figure 8: Human Preference for Edited Texts on Validation Set: DecorateLM vs. GPT-4.

Does it work? Figure 8 shows human evaluation results. The editing model (DecorateLM) significantly improves text clarity, fluency, and logical coherence. In fact, for metrics like “Information Precision,” the edited text often wins against the original raw text.

The impact of editing is measurable mathematically via perplexity (a measure of how “surprised” a model is by text). Lower perplexity usually means the text is more natural and easier to learn.

Figure 7: Perplexity distribution of the corpus.

Figure 7 demonstrates that the edited corpus (orange) has a significantly lower perplexity distribution than the original raw corpus (blue). This suggests the data has become more regular and predictable—ideal for the learning process.


The Experiments: Does it Actually Work?

To prove that “decorating” data leads to better models, the researchers conducted a rigorous experiment. They took a baseline model (MiniCPM-1.2B) and continued its training using different versions of the data:

  1. Baseline: Raw data.
  2. Rated: Data sampled based on quality scores.
  3. Tagged: Data sampled to balance domains.
  4. Edited: Data rephrased for quality.
  5. Combined: All of the above.

Sampling Strategies

The researchers didn’t just throw the data in. They used mathematical sampling strategies based on the scores.

For Ratings, they used an exponential weighting formula. Data with higher quality scores (\(score_{i,t}\)) was given a higher probability (\(W_{i,t}\)) of being selected for training:

() W _ { i , t } = e ^ { \\frac { \\mathrm { s c o r e } _ { i , t } - \\lambda } { \\tau } } , ()

For Tags, they inversely weighted domains. If a domain (like “Sports”) was rare in the raw data, they increased its weight to ensure the model didn’t ignore it:

() \\begin{array} { r l r } { { W _ { a , b , c } = \\frac { N _ { \\mathrm { I = } a } ^ { \\alpha } } { \\sum _ { i = 1 } ^ { N _ { \\mathrm { I } } } N _ { \\mathrm { I = } i } ^ { \\alpha } } \\cdot \\frac { N _ { \\mathrm { I = } a , \\mathrm { I I } = b } ^ { \\beta } } { \\sum _ { i = 1 } ^ { N _ { \\mathrm { I = } a , \\mathrm { I I } } } N _ { \\mathrm { I = } a , \\mathrm { I I } = i } ^ { \\beta } } \\cdot } } \\ & { } & { \\frac { N _ { \\mathrm { I = } a , \\mathrm { I I = } b , \\mathrm { I I I } = c } ^ { \\gamma } } { \\sum _ { i = 1 } ^ { N _ { \\mathrm { I = } a , \\mathrm { I I = } b , \\mathrm { I I } } } N _ { \\mathrm { I = } a , \\mathrm { I I = } b , \\mathrm { I I I } = i } ^ { \\gamma } } , } \\end{array} ()

The Results

The results were compelling. The models trained on decorated data outperformed the baseline across almost all benchmarks.

Table 2: Comparison of benchmark performance across different strategies.

In Table 2 (above), look at the Avg. column.

  • Base: 36.1
  • Rat. (Agg.) & Edit: 38.4
  • Rat (Agg) & Tag & Edit: 38.3 (but with much higher domain coverage).

The method Rat. (Agg.) & Edit (Rating and Editing) showed the highest jump in general reasoning tasks like MMLU and GSM8K (Math).

The tagging system specifically helped with Domain Coverage—ensuring the model knew about niche topics.

Figure 9: The performance of the MMLU-Tag. Model across the various subtasks of MMLU. The tasks where the sampling weights are increased on the corresponding tags based on the Tag. Methods are highlighted in red.

Figure 9 shows the impact of tagging on specific subjects. When the researchers up-sampled specific tags (like “Anatomy” or “Astronomy”), the model’s performance on those specific MMLU subtasks (orange bars) often exceeded the baseline (blue bars).

Finally, looking at rare domains specifically:

Table 4: Comparison of rare domain benchmark performance across different strategies.

Table 4 shows that for specialized fields like Law (JECQA) or Medicine (MedQA), the combined approach of Rating, Tagging, and Editing yielded the best results, significantly boosting the Average Domain Coverage (Avg. DC) from 37.5 to 45.0.

Conclusion: Refined Data for Refined Models

The DecorateLM paper provides strong evidence for a shift in how we approach LLM training. We are moving away from the era of “Big Data” and into the era of “Smart Data.”

By synthesizing the intuition of advanced models (like GPT-4) into efficient “Data Engineers” (DecorateLM), we can process vast oceans of information.

  • Rating ensures we learn from the experts, not the novices.
  • Tagging ensures we are well-read across all subjects, not just the popular ones.
  • Editing ensures the study material is clear, concise, and structured.

This three-pronged approach allows smaller models to punch above their weight and makes the training of massive models more efficient. As the researchers noted, the decorated corpus is not just smaller; it is denser with knowledge. In the race to build Artificial General Intelligence, it seems that quality control is the ultimate accelerator.