Introduction

Imagine you are a chef tasked with creating the perfect soup. You have a pantry full of ingredients: carrots, potatoes, beef, spices, and dozens of other items. You know that the composition of your soup—the exact ratio of carrots to potatoes—will determine whether it tastes delicious or bland.

In the culinary world, you might taste-test a spoonful. But in the world of Large Language Models (LLMs), “cooking” a single recipe (training a model on a specific mixture of data) costs hundreds of thousands of dollars and weeks of computation time. If you want to test 50 different data combinations, you need 50 separate training runs. This is the data ablation bottleneck.

As LLMs grow larger, deciding what data to train them on is just as important as the architecture itself. Yet, because testing every possible combination of data is prohibitively expensive, researchers often rely on intuition or limited experiments, potentially leaving performance gains on the table.

In a recent paper, Scalable Data Ablation Approximations for Language Models through Modular Training and Merging, researchers from the Allen Institute for AI, Carnegie Mellon, and the University of Washington propose a brilliant workaround. Instead of cooking 50 different soups, what if you could cook each ingredient separately and then mathematically “mix” the cooked versions to predict exactly how the final soup would taste?

This blog post explores their method of Modular Training and Merging, a technique that allows researchers to simulate comprehensive data ablation studies at a fraction of the computational cost.

Figure 1: Traditional approaches require training a new model for every data mixture. The proposed modular strategy reuses models trained on data partitions, merging their parameters to approximate the performance of a full training run.

The Problem: The Combinatorial Explosion

To understand why this research is necessary, we have to look at the math of data mixtures.

Let’s say your training corpus \(\mathcal{C}\) is divided into \(n\) different partitions (e.g., Wikipedia, GitHub code, arXiv papers, Reddit threads). You want to know which combination of these partitions yields the best model. If you simply want to include or exclude partitions, the total number of possible combinations is \(2^n\).

  • If \(n=10\), you have 1,024 combinations.
  • If \(n=30\), you have over a billion combinations.

Even if you limit yourself to mixtures of a fixed size \(k\), the complexity scales polynomially (\(O(n^k)\)). In the traditional “naive” approach, you must train a fresh model from scratch for every single combination you wish to evaluate. This is computationally infeasible for anyone without infinite resources.

The researchers propose a method that scales linearly, \(O(n)\). You train on each partition once, and then you can approximate the results of any combination of those partitions almost instantly.

Figure 2: The computational complexity comparison. The green line shows the linear runtime of the modular approximation method, while the blue and orange lines show the exponential and polynomial costs of traditional training.

The Solution: Modular Training and Merging

The core innovation relies on a concept called parameter averaging (sometimes referred to as model merging or model soups). The hypothesis is simple but powerful:

The perplexity (error rate) of a single model trained on a mixture of Data A and Data B is strongly correlated with the perplexity of a parameter average of two separate models: one trained only on Data A, and one trained only on Data B.

Here is the step-by-step recipe proposed by the authors:

1. Partitioning the Data

First, the training corpus is split into “base units” or partitions. These could be defined by topic (e.g., “Biology,” “Computer Science”) or time (e.g., “2020,” “2021”).

The researchers formalized this setup as follows: Equation representing the union of data partitions from S2ORC and Wikipedia.

For their experiments, they used two primary datasets:

  1. S2ORC: The Semantic Scholar Open Research Corpus (academic papers).
  2. M2D2 Wikipedia: A multi-domain Wikipedia dataset.

They ensured the data was decontaminated (checked against evaluation sets) to ensure fair testing.

Table 1: Statistics of the datasets used, including token counts for pre-training and evaluation.

2. Modular Training

Instead of training one giant model on all the data, the researchers train many small “expert” models. Each model is trained on just one of the base unit partitions.

Critically, all these models share a “seed” optimization trajectory. They are all initialized from the same pre-trained checkpoint (trained on a general corpus of Gutenberg books and Wikipedia for 42B tokens) before branching off to train on their specific partitions. This shared history ensures the models remain in the same “basin” of the loss landscape, which is essential for the next step to work.

3. Merging (The Approximation)

To simulate a model trained on a mixture of Partition A and Partition B, the researchers do not train a new model. Instead, they take the weights (parameters) of Model A and Model B and average them.

If the partitions are of different sizes, they perform a weighted average based on the number of tokens or partitions used.

Does It Actually Work? The Experiments

The researchers validated this approach by comparing three specific metrics:

  1. SEQ (Sequential): The “Gold Standard.” A model actually trained on the combined data mixture. This is expensive but represents the “truth” we want to predict.
  2. IND (Individual): Simply taking the average of the perplexity scores of the individual models. (i.e., Model A gets score X, Model B gets score Y, the prediction is \((X+Y)/2\)).
  3. MERGED: The proposed method. Averaging the weights of Model A and Model B into a new model, and then measuring that new model’s perplexity.

Experiment 1: Simple Pairs and Triples

In the first set of experiments, the authors mixed pairs (\(k=2\)) and triples (\(k=3\)) of data partitions. They trained 130 million parameter models (small by today’s standards, but sufficient for proof of concept) and compared the SEQ results against their proxy metrics.

The results, visualized below, show a striking correlation.

Figure 3: Scatter plots showing the correlation between SEQ (true) perplexity and proxy metrics. The bottom row shows OOD evaluation on the Paloma dataset. The Green triangles (MERGED) show a strong correlation with the actual training performance.

Key Takeaway: Look at the bottom row of Figure 3 (evaluating on Out-Of-Distribution data, specifically the Paloma benchmark). The green triangles (MERGED) form a tight diagonal line with the SEQ scores. This means that if the merged model thinks a data mixture is good (low perplexity), the actual sequentially trained model effectively confirms it.

The correlation scores confirm this visual intuition:

Table 2: Pearson’s correlation scores. MERGED models achieve high correlation (0.95+) with SEQ models on OOD data.

On the Paloma evaluation set, the MERGED approach achieved a Pearson correlation of 0.961 for S2ORC pairs and 0.993 for Wikipedia pairs. This suggests that you can reliably rank data mixtures using the cheap merging method.

Experiment 2: Uneven Data Sizes

Real-world data is rarely perfectly balanced. You might have 10GB of biology text and only 100MB of poetry. How do you handle merging models trained on vastly different amounts of data?

The authors explored two merging strategies for uneven partitions:

  1. Macro-Merging: Train a model on the large partition and a model on the small partition, then average them.
  2. Micro-Merging: Break the large partition into many small “base units” equal in size to the small partition. Train models on all units. Average all the resulting base models together uniformly.

Figure 4: SEQ vs. proxy scores for uneven partition sizes. Macro-MERGED scores (red triangles) proved to be highly reliable proxies.

The results in Figure 4 show that Macro-MERGED scores (red inverted triangles) are incredibly predictive (\(r=0.984\)) of the true model performance.

However, when mixing distinct high-level sources (e.g., combining academic papers with Wikipedia articles), the strategy shifts slightly.

Figure 5: Results for mixed-source data mixtures. Here, Micro-MERGED models (purple inverted triangles) offered the best correlation (0.989).

In Figure 5, we see that Micro-Merging (breaking everything down into small, equal units and averaging them) is superior when mixing different domains. This suggests that maintaining a “pool” of models trained on equal-sized chunks of data is a robust strategy for simulation.

Scaling Up: Does this apply to “Real” Models?

A common criticism of academic LLM research is that methods working on 130M parameter models might fall apart at the multi-billion parameter scale. The authors addressed this by replicating experiments with 1.1 Billion parameter models.

They found two promising results:

  1. Self-Prediction: 1.1B parameter merged models successfully predict the performance of 1.1B parameter sequentially trained models.
  2. Cross-Scale Prediction: Even more excitingly, 130M parameter merged models can predict the ranking of 1.1B parameter sequentially trained models.

Figure 6: Cross-scale prediction. The proxy scores of small 130M models (x-axis) effectively predict the performance of larger 1.1B SEQ models (y-axis).

This cross-scale capability is a massive efficiency win. It implies you don’t even need to perform the modular training at the full target scale. You can run your data ablation search using small, cheap models, find the best data recipe, and then commit your heavy compute resources to training the large model on that specific recipe.

Table 4: Correlation scores showing that 130M macro-merged models (0.926) are highly correlated with 1.1B SEQ model performance.

The data in Table 4 supports this, showing a 0.926 correlation between the small proxy models and the large sequential models.

Validation on Domain-Specific Fields

The researchers also zoomed in on specific fields of study within the S2ORC corpus to ensure the trend held across different topics like Biology, Physics, and Computer Science.

Figure 7: Scatter plots for individual fields of study. The relationship holds across various domains, reinforcing the robustness of the method.

Figure 7 illustrates that whether you are evaluating on Mathematics or Biology, the relationship between the proxy (Merged) and the truth (SEQ) remains positive and linear.

Why does this work?

The paper touches on Linear Mode Connectivity. Because all the modular models start from the same pre-trained “seed,” they remain in the same optimization basin. When you average their weights, you aren’t jumping to a random point in high-dimensional space; you are finding a point that effectively captures the capabilities of both parents. This is distinct from ensembling (averaging outputs); this is averaging the actual “brains” of the models.

Other concurrent work has also noted this phenomenon, suggesting a broader “law” of data mixing.

Figure 10: Comparison with concurrent work by Ye et al. (2024), showing compatibility with efficiency gains from small-to-large model prediction.

Implications and Conclusion

The “Modular Training and Merging” approach essentially turns data ablation from an NP-hard problem into a manageable engineering task.

The Recipe for Efficient Data Selection:

  1. Decompose your massive training corpus into equal-sized chunks.
  2. Train a small, efficient model on each chunk (in parallel).
  3. Cache these models.
  4. Simulate billions of possible data mixtures by averaging the weights of these cached models.
  5. Select the best performing mixture based on the simulation.
  6. Train your final, massive LLM on that specific mixture.

This method scales linearly with the amount of new data. If you get a new dataset next month, you don’t need to retrain everything to see how it fits; you just train one module on the new data and “merge” it into your existing simulations to see if it improves the recipe.

Figure 11: Even with weaker correlations in some setups, the method remains useful for selecting the most performant data mixtures.

While there are limitations—the correlations aren’t perfect (1.0), and the method relies on a shared “seed” training phase—the efficiency gains are undeniable. For researchers and companies with limited compute budgets, this technique provides a map through the dark forest of data selection, allowing them to find the optimal path without walking down every single trail.

By treating models as modular blocks that can be stacked and blended, we move closer to a science of data composition, rather than relying on the alchemy of intuition.