Introduction

In the current landscape of Large Language Model (LLM) development, data is the new gold. But not just any data—we are specifically obsessed with Instruction Tuning (IT) data. This is the dataset that turns a raw, text-predicting base model into a helpful chatbot that can answer questions, summarize emails, and write code.

A prevailing trend in recent research is “Less is More.” Studies like LIMA (Less Is More for Alignment) have suggested that you don’t need millions of instructions to train a great model; you might only need 1,000 highly curated, high-quality examples. This has triggered a gold rush to find the perfect “high-quality” dataset. Every week, a new paper claims that “Dataset A is better than Dataset B” or that a specific filtering method selects the best data.

But here is the uncomfortable question: How do we actually know Dataset A is better?

Usually, we use a proxy metric. We take a model, train it on Dataset A, and measure how well it performs. Then we train the same model on Dataset B and compare. If the model trained on A wins, we declare Dataset A the winner.

It sounds logical, but a new paper titled “Call for Rigor in Reporting Quality of Instruction Tuning Data” reveals a critical flaw in this process. The researchers demonstrate that the conclusions we draw about data quality are heavily dependent on hyperparameters—settings like learning rate, batch size, and epoch count—that researchers often pick arbitrarily.

In this post, we will deep dive into this paper to understand how arbitrary training choices can flip scientific conclusions, why “standard” settings might be failing us, and how we can bring more rigor to the science of LLM alignment.

Background: The Data Quality Proxy

Before we dissect the problem, let’s establish the context. Instruction Tuning is the process of finetuning a pre-trained model (like Llama-2 or Mistral) on pairs of inputs and outputs (Instructions and Responses).

Because we cannot inspect the “quality” of a dataset directly in a mathematical sense (quality is subjective), the community relies on empirical results. The logic follows this syllogism:

Premise: A good dataset produces a good model.
Experiment: We trained a model on this dataset, and the model obtained a high score on a benchmark (like MT-Bench or AlpacaEval).
Conclusion: Therefore, this dataset is high quality.

This approach puts immense pressure on the training process. If you are comparing two datasets, you must ensure that the training process is fair to both.

The Hidden Variables

When training a neural network, you don’t just feed it data. You have to configure the “oven” in which the model is cooked. These configurations are called hyperparameters. The most common ones include:

Epochs: How many times the model sees the entire dataset.
Learning Rate (LR): How drastically the model updates its internal weights after seeing an error.
Batch Size: How many examples the model looks at before updating its weights.
Scheduler: How the learning rate changes over time (e.g., does it stay constant, or does it decay?).

In traditional machine learning, finding the best hyperparameters (Hyperparameter Optimization or HPO) is a standard step. However, in the era of LLMs, training is expensive. Doing a “grid search” (trying every combination) for a 7-billion parameter model is often too costly for academic labs.

As a result, many researchers adopt “standard” settings used in previous papers, or they make an educated guess and stick with it. The authors of our focus paper noticed that these choices are shockingly inconsistent across the field.

$Table 1: Hyperparameters reported by previous studies,adopted to train LLMs with 1K general domain IT data.The data pool details the sources from which the 1K data samples were drawn.Detailed descriptions of these data pools are provided in the Table 4. The \$\\ ' _ { + } \\ '\$ symbol indicates experiments where samples were drawn from a combined data mix of all mentioned datasets. The \$\\because \\mathit { \\Pi } ^ { \\mathrm { ~ , ~ } }\$ symbol reports studies that sampled individually from each data pool.$

Take a look at Table 1 above. This table lists various recent studies that all trained models using roughly 1,000 samples of instruction data.

Notice the variety:

Epochs range from 3 to 15.
Learning Rates fluctuate between 1e-4 and 2e-6 (a huge magnitude difference).
Batch Sizes vary from 8 to 128.

Some papers use a “Cosine” scheduler; others use “Linear” or none. Yet, all these papers publish definitive conclusions about data quality based on these widely different setups. The authors of the paper asked a simple, provocative question: Can we change the conclusion of a paper just by changing these settings?

The Core Method: Stress-Testing Conclusions

To test this, the authors set up a duel between two famous datasets in the “1,000 sample” size category.

The Contenders

LIMA (Less Is More for Alignment): A dataset of 1,000 prompts and responses carefully curated by humans. It is famous for proving that a small amount of human-quality data can beat massive datasets.
Alpaca-Longest: A subset of 1,000 samples selected from the larger Alpaca dataset. The selection criteria was simply taking the samples with the longest token length. A previous study (Zhao et al., 2024a) claimed that Alpaca-Longest represents better quality data than LIMA.

The Setup

The researchers decided to train a Llama-2-7B model on both datasets. However, instead of picking one “standard” set of hyperparameters, they created a grid of 16 different settings (Setting 1 through Setting 16).

They varied:

Scheduler: Cosine vs. Linear.
Learning Rate: 1e-5 vs. 2e-5.
Batch Size: 64 vs. 256.
Epochs: 3 vs. 15.

They then evaluated the resulting models using GPT-4 as a judge (a common practice known as “LLM-as-a-judge”) across three benchmarks: Koala, MT-Bench, and Self-Instruct.

The goal was to see if LIMA always lost to Alpaca-Longest, or if the “winner” shifted depending on the hyperparameters.

Experiments & Results

The results of this investigation are visually summarized in Figure 1. This is the most critical visualization in the paper, so let’s break it down carefully.

In Figure 1, each row represents one of the 16 hyperparameter settings. The columns show the win/tie/loss rates against three different benchmarks (Koala, MT_Bench, Self-Instruct). The numbers indicate how often the model trained on LIMA won against the model trained on Alpaca-Longest, or vice-versa.

The “Flip” Phenomenon

Look closely at the data.

Case A (Alpaca Wins): Look at Setting 4 (Cosine, 1e-5 LR, 256 Batch, 3 Epochs). In the Koala benchmark, LIMA gets 47 wins, but suffers significantly against Alpaca. In MT_Bench, the score is 57 (lower than other settings). If you were a researcher who arbitrarily chose Setting 4, you would likely conclude: “Alpaca-Longest is superior to LIMA.”
Case B (LIMA Wins): Now look at Setting 8 (Cosine, 2e-5 LR, 256 Batch, 3 Epochs). Suddenly, on the Koala benchmark, LIMA scores 95 wins compared to Alpaca’s 22 wins. On MT_Bench, it jumps to 63. If you chose Setting 8, you would write a paper concluding: “LIMA is far superior to Alpaca-Longest.”

This is the “arbitrary conclusion” problem. Both statements are scientifically reproducible, yet they contradict each other. The only difference is a slight tweak in the Learning Rate or Batch Size.

The Problem with “Standard” Settings

The authors dug deeper. They wanted to know if the “standard” settings people use are even effective. They specifically looked at the number of epochs.

In the literature (refer back to Table 1), it is very common to train for 3 epochs. This is likely a habit carried over from training on massive datasets (like the full Alpaca 52k dataset), where 3 epochs take a long time and provide sufficient updates.

However, when you only have 1,000 data points (as is the case with LIMA or Alpaca-Longest), 3 epochs might not be enough for the model to learn effectively.

Table 2: We report the performance of the Llama-2-7B model, trained under each setting,as evaluated on the Koala dataset. Details for each setting are presented in the Figure 1.

Table 2 provides a detailed breakdown of head-to-head battles on the Koala dataset. It compares a baseline (Setting 1) against other settings.

A key trend emerges regarding Epochs:

Look at Setting 1 (15 Epochs) vs Setting 2 (3 Epochs).
Look at Setting 7 (15 Epochs) vs Setting 8 (3 Epochs).

The authors found that configurations with 15 Epochs (Settings 7 and 15) generally maximized model performance. This suggests that the community’s reliance on “3 Epochs” for small datasets is causing us to report results from under-trained models. We are judging data quality based on models that haven’t finished learning yet.

Is This Just a Llama Problem?

Scientific rigor demands we check if this is a fluke specific to the Llama-2 architecture. To address this, the authors repeated the entire suite of experiments using Mistral-7B-v0.3, a more modern and powerful model.

As shown in Figure 2, the chaos remains.

Setting 12 strongly favors LIMA (116 wins vs 21 losses on Koala).
Setting 1 is much more balanced or even favors Alpaca in different splits.

The “Flip” is present here too. The authors also provided the detailed win/loss breakdown for Mistral in Table 3.

Table 3:We report the performance of the Mistral-7B model, trained under each setting,as evaluated on the Koala dataset. Details for each setting are presented in the Figure 1.

The consistency of inconsistency is striking. Whether you use Llama-2 or Mistral, if you pick your hyperparameters blindly, your conclusions about data quality are effectively random.

How Was This Judged?

To ensure the judging itself wasn’t the source of the error, the authors utilized a strict “LLM-as-a-judge” prompt, fed to GPT-4.

BDataset Details Table 5: Prompt used for training the LLM: For models not supporting system prompts, we combined the system prompt and user prompt into a single input statement.

The prompt (shown in the image above) explicitly asks the judge to avoid position bias (favoring the first answer) and length bias (favoring the longer answer), ensuring the evaluation focuses on helpfulness and relevance. While LLM-based judging has its own flaws, the relative differences between hyperparameter settings remain valid because the judge is constant across all experiments.

Discussion: The Need for Local Optimality

So, what is the solution? We can’t expect every researcher to run a massive grid search costing thousands of dollars for every experiment.

The authors argue for the concept of Local Optimality.

When claiming “Dataset A is better than Dataset B,” it is not fair to just pick one setting (e.g., Learning Rate 1e-5, 3 Epochs) and run both datasets through it. Why? Because Dataset A might converge faster than Dataset B. Dataset A might need a higher learning rate to escape a local minimum, while Dataset B might overfit with that same rate.

If you judge a fish by its ability to climb a tree, it will fail. Similarly, if you judge a dataset by a hyperparameter setting that doesn’t suit it, you aren’t measuring the data quality; you’re measuring the compatibility of the data with those arbitrary settings.

The Authors’ Recommendations

Justify Your Choices: If you pick a learning rate of 2e-5, explain why. Don’t just say “Previous Paper X did it,” especially if Paper X used a different model or dataset size.
Find the Peak: You should attempt to find the “locally optimal” hyperparameters for each dataset you are testing. Report the performance of Dataset A at its best against Dataset B at its best.
Acknowledge Under-training: For small datasets (like the popular 1k subsets), the standard “3 epochs” is likely insufficient. The experiments showed that 15 epochs often yielded significantly better alignment.

Conclusion & Implications

This research serves as a necessary wake-up call for the NLP and LLM community. We are currently building the foundations of AI alignment on the assumption that we know which data is “good.”

The paper “Call for Rigor in Reporting Quality of Instruction Tuning Data” proves that this assumption is shaky. By simply toggling a learning rate or extending training by a few epochs, we can completely reverse the leaderboard of “Top Datasets.”

Key Takeaways for Students and Researchers:

Skepticism: When reading a paper that claims “Our data selection method beats the baseline,” check the Hyperparameters. Did they use the same settings? Were those settings optimal for the baseline?
Sensitivity: Be aware that LLM training is highly sensitive. “Data Quality” is not a static property; it is an interaction between the data and the training recipe.
Rigor: If you are running your own experiments, do not simply copy-paste config files. Run small ablations. verifying that your model has actually converged (finished learning).

The future of open-source AI depends on high-quality data. To find that data, we first need to ensure our compass—the way we train and evaluate—is pointing True North, not just pointing wherever the random learning rate takes us.

Introduction#

Background: The Data Quality Proxy#

The Hidden Variables#

The Core Method: Stress-Testing Conclusions#

The Contenders#

The Setup#

Experiments & Results#

The “Flip” Phenomenon#

The Problem with “Standard” Settings#

Is This Just a Llama Problem?#

How Was This Judged?#

Discussion: The Need for Local Optimality#

The Authors’ Recommendations#

Conclusion & Implications#