If you have been following the explosion of Large Language Models (LLMs) specialized for coding, you have likely seen the leaderboards. Every week, a new open-source model claims to rival GPT-4 on benchmarks like HumanEval. It seems we are in a golden age of automated programming.

But there is a catch. If you take these high-flying models and test them on newer, fresher problems from competitive programming sites, their performance often collapses. Why?

The answer, according to a fascinating new paper by researchers from Beijing University of Posts and Telecommunications and Meituan, is data leakage. Many models are essentially “memorizing the test answers” because their training data is contaminated with benchmark problems.

In this deep dive, we will explore the paper “How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with Really Good Data.” We will uncover how the researchers detected this leakage, how they cleaned the data, and most importantly, the novel XCoder method they developed to select only the highest-quality, most diverse, and complex data for training.

The Problem: When 100% Accuracy is a Bad Sign

To understand the contribution of this paper, we first need to look at the current state of Code Instruction Tuning. This is the process of taking a pre-trained base model (like LLaMA or StarCoder) and fine-tuning it on pairs of instructions and code solutions. The goal is to align the model to follow user requests.

Datasets for this task usually come from two sources:

  1. Distillation: Asking a strong model (like GPT-4) to generate coding problems and solutions (e.g., Code Alpaca, WizardCoder).
  2. Real-world Data: Mining GitHub commits or StackOverflow (e.g., OctoPack).

When researchers trained models on popular open-source datasets (like Magicoder-Evol-Instruct or Code-Feedback), they noticed something suspicious.

Figure 1: The left figure shows performance comparison on different benchmarks and the right displays varying results after data decontamination. Magicoder Evol-Instruct and Code-Feedback may have data leakage on HumanEval.

As shown in Figure 1 (left), models like Magicoder-Evol-Instruct achieve incredible scores on HumanEval (a standard, older benchmark). However, their performance drops significantly on LiveCodeBench, a benchmark that continuously collects new problems from contests to prevent contamination.

The discrepancy suggests that these models aren’t necessarily “smart”; they have just seen the HumanEval questions during training.

Measuring the Leakage: The TLI Metric

To prove this, the authors introduced a metric called the Test Leakage Index (TLI).

TLI quantifies how much the training data overlaps with the test data. It works by generating n-grams (sequences of text) for both datasets and calculating similarity scores. A high TLI means the model is effectively training on the test set.

Table 2: Comparison of performance across three datasets with data leakage and their cleaned versions on HumanEval and LiveCodeBench. TLI measures the extent of data leakage in the training set on HumanEval. Size and performance changes after cleaning are highlighted in red.

Table 2 reveals the extent of the problem. Datasets like Magicoder-Evol-Instruct had a TLI of 43.2, which is astronomically high compared to a “clean” baseline of around 5.0.

When the researchers cleaned the data (removing the leaked samples), performance on HumanEval plummeted (e.g., from 68.3% to 65.9% or lower), while performance on the unseen LiveCodeBench remained stable. This confirmed that the “state-of-the-art” results were largely an illusion caused by data leakage.

The Solution: Defining “Good” Data

Once the leaked data is removed, we are left with a massive pool of potential training data—over 2.5 million samples from various open-source collections. Training on all of this is computationally expensive and inefficient.

The core contribution of this paper is a systematic way to prune this data. The authors propose that “good” code instruction data must satisfy three dimensions:

  1. Complexity: The problem shouldn’t be trivial.
  2. Quality: The code solution must be correct and robust.
  3. Diversity: The dataset shouldn’t just be 10,000 variations of “print Hello World.”

They combined these three into a pipeline to create XCoder, a model trained on a highly curated subset of data.

Figure 2: Illustration of our data selection approach.

Figure 2 illustrates the workflow. Let’s break down how they quantified each dimension.

Dimension 1: Instruction Complexity

To measure complexity, the researchers didn’t rely on simple heuristics like length. Instead, they trained a specific Complexity Scorer.

They utilized the concept of “Evol-Instruct,” where a simple instruction is iteratively made more complex using ChatGPT. By tracking how many rounds of evolution a sample had gone through, they created a labeled dataset to train a model that can predict the complexity score of any given instruction.

Dimension 2: Response Quality (The Unit Test Approach)

This is perhaps the most innovative part of their method. In general text generation, “quality” is subjective. In coding, code either runs or it doesn’t.

However, just checking if code compiles isn’t enough. It needs to solve the specific problem described. The researchers developed a Unit Test Model.

Figure 8: Input and output case of unit test model.

As seen in Figure 8, they trained a model (finetuned LLaMA3-70B) specifically to read a problem description and generate a suite of Unit Tests (assertions that verify the code’s logic).

During the data selection process, for every training pair \((Instruction, Code)\), the system:

  1. Feeds the instruction to the Unit Test Model to generate test cases.
  2. Executes the provided code against these test cases.
  3. Calculates a quality score based on the pass rate.

This serves as a rigorous filter. If a training sample contains code that doesn’t actually solve the instruction, it gets a low score and is discarded.

Why use a model for unit tests? You might wonder if an LLM is reliable enough to write tests. The authors analyzed this and found that scaling the model size significantly improves test generation accuracy.

Figure 4: Comparison of the accuracy of Unit Test Models trained on different sizes when generating test cases. We also additionally evaluated the ability of GPT4 to generate test cases.

Figure 4 shows that their 70B Unit Test model achieves nearly 79% accuracy, rivaling GPT-4. This reliability allows them to trust the automated scoring of millions of data samples.

Dimension 3: Instruction Diversity

Finally, having complex and correct code isn’t useful if all the samples are semantically identical. To ensure breadth, the researchers employed Diversity-based Sampling.

They embedded the instructions into vector space. When selecting data, they iteratively added samples only if they were sufficiently “distant” (dissimilar) from the examples already selected in the pool. This ensures the model is exposed to a wide range of coding concepts and problem types.

Experiments: Does XCoder Work?

The researchers applied this selection strategy to create the XCoder dataset and trained LLaMA3-based models on it. The results were impressive, particularly in terms of data efficiency.

Table 3: Comparison of the performance using XCoder data and other mainstream data on HumanEval and LiveCodeBench. All models are trained based on LLaMA3-8B-Base and use greedy decoding.

Table 3 highlights the key findings:

  • Efficiency: XCoder trained on just 40k samples outperforms Magicoder-Evol-Instruct (100k samples) on the difficult, contamination-free LiveCodeBench.
  • Generalization: It achieves state-of-the-art results on BigCodeBench as well.
  • Leakage-Free: While it doesn’t hit the artificially inflated scores of contaminated models on HumanEval, it performs honestly and robustly across the board.

The data scaling experiments (shown in the image deck as Table 7) further reinforced this. XCoder with 10k samples achieved performance comparable to random sampling with 160k samples. This confirms the “quality over quantity” hypothesis.

Figure 3: Comparison of the performance of XCoder and other mainstream models on LiveCodeBench.

When scaled up to a 70B parameter model, XCoder-70B (shown in Figure 3) positions itself as one of the best open-source code models available, competing closely with proprietary giants like GPT-4 and Gemini Pro 1.5 on the Easy-Pass@1 metric of LiveCodeBench.

The Anatomy of High-Quality Data

One of the most educational aspects of this paper is the retrospective analysis. After letting their algorithm select the best 160k samples, the researchers looked back to see where those samples came from. This tells us what kind of data construction methods are actually valuable.

Figure 5: The contribution ratio of diffrent data sources to XCoder…

Figure 5 offers a breakdown of the selected data:

  1. Complexity (Chart a): The Code-Feedback dataset (which involves multi-turn refinement) was a major source of complex instructions. This suggests that data involving “debugging” or “improving” code is more cognitively demanding than simple one-shot generation.
  2. Quality (Chart b): OctoPack (derived from real Git commits) and StarCoder2-Self-Align contributed significantly to quality. This makes sense: code committed to GitHub usually works, and Self-Align datasets often include execution filters.
  3. Diversity (Charts c & d): This is the most striking finding. Before diversity sampling, synthetic datasets dominated. After diversity sampling (Chart d), the share of OctoPack (real-world data) surged from 2% to near the top tier.

The Insight: Synthetic data (generated by GPT-4) is great for complexity and quality, but it tends to be repetitive. Real-world data (from humans) is messy and sometimes simple, but it provides the necessary diversity to cover the long tail of programming scenarios.

Conclusion

The paper “How Do Your Code LLMs Perform?” serves as a crucial reality check for the field of AI programming. It exposes the rampant data leakage that inflates benchmark scores and offers a robust methodology for fixing it.

For students and practitioners, the takeaways are clear:

  • Trust, but Verify: Be skeptical of HumanEval scores. Look at LiveCodeBench or recent, contamination-free benchmarks.
  • Data Pruning is Key: You don’t need millions of rows of data. You need clean, complex, and verified data.
  • The Power of Execution: Using a model to write unit tests to verify training data is a powerful technique that moves us beyond simple text similarity.
  • Mix Your Sources: The best results come from a blend of complex synthetic instructions and diverse real-world examples.

XCoder demonstrates that with the right data strategy, we can train smaller, more efficient models that actually understand code, rather than just memorizing it.