Introduction

In the current landscape of Artificial Intelligence, proprietary Large Language Models (LLMs) like GPT-4, Gemini, and Claude dominate the leaderboard, particularly in code generation. Their ability to write complex Python scripts or debug software is impressive. However, their closed-source nature raises concerns regarding data privacy, cost, and accessibility.

This has fueled a race to develop open-source alternatives (like CodeLlama, DeepSeek-Coder, or StarCoder) that can match the performance of these proprietary giants. The primary method for doing this is Knowledge Distillation. In this process, a powerful “teacher” model (like GPT-4) generates synthetic training data—specifically, instruction-response pairs—which are then used to train a smaller “student” model.

But there is a flaw in this pipeline. What happens if the teacher is “lazy” or incorrect? When faced with complex coding tasks, even models like GPT-4 can produce buggy, inefficient, or monolithic code. If a student model learns from these low-quality responses, its performance ceiling is artificially lowered.

In this post, we will explore a new framework proposed by researchers from Hong Kong Baptist University and Alibaba DAMO Academy called AMR-Evol (Adaptive Modular Response Evolution). This method rethinks how we generate training data, moving away from simple question-answering to a sophisticated process of modular decomposition and evolutionary refinement.

The Problem: The Limits of Direct Distillation

To understand why AMR-Evol is necessary, we first need to look at the standard approach, known as Direct Response Distillation.

In a typical setup, researchers collect a list of coding problems (instructions) and feed them to a teacher model. The teacher outputs a code snippet, and this pair (Instruction, Response) becomes a training example for the student.

For simple tasks, this works fine. But for complex requirements—such as “write a function to calculate a matrix determinant using Laplace expansion with specific constraints”—the teacher model often struggles. It might generate code that is logically convoluted, fails edge cases, or simply ignores specific constraints (like “use nested loops”).

Direct distillation from the teacher model possibly yields low quality responses for complex tasks, thereby causing confusion within the student model.

As illustrated in Figure 1, when the task is complex, the “Direct Distillation” path leads to a “Low Quality Response.” The code might look correct at a glance but fails when scrutinized (represented by the red X). When the student model trains on this data, it essentially “memorizes” the teacher’s mistakes and confusion.

The researchers identified that simply asking the teacher to “try again” or “repair” its own code (a method called Self-Repair) often fails because the model lacks external guidance or a structured way to simplify the problem.

The Solution: The AMR-Evol Framework

The core insight of AMR-Evol is borrowed from software engineering principles: Modular Programming. Instead of trying to write a complex program in one breath, it is better to break it down into smaller, reusable functions.

The AMR-Evol framework turns the data generation process into a two-stage pipeline:

Modular Decomposition (MD): Breaking the problem down.
Adaptive Response Evolution (ARE): Refining the solution using a database of verified knowledge.

Our Adaptive Modular Response Evolution (AMR-Evol) framework with modular decomposition and adaptive response evolution elicits better response distillation for LLMs in code generation.

Figure 2 provides a high-level overview of this workflow. Let’s break down exactly what is happening in each stage.

Stage 1: Modular Decomposition (MD)

In the first stage, the system takes the potentially flawed “Direct Response” and the original instruction, and asks the teacher model to decompose the solution into sub-modules.

Mathematically, this looks like this:

Equation showing the decomposition of instructions and direct responses into function modules.

Here, the teacher model (\(M_t\)) analyzes the instruction (\(I\)) and the initial response (\(R_d\)) to produce a set of function modules (\(F_1^m, F_2^m, \dots\)).

Why does this help? By forcing the model to define distinct functions (e.g., calculate_determinant, get_submatrix, validate_input), the complexity of the task is reduced. It shifts the model’s focus from generating a massive block of code to solving specific, smaller problems.

Stage 2: Adaptive Response Evolution (ARE)

The second stage is where the “Evolution” happens. The researchers recognized that while coding tasks vary, the building blocks (sub-modules) often recur. A function to “check if a number is prime” or “transpose a matrix” is the same regardless of the larger program it belongs to.

AMR-Evol maintains a Functional Module Database—a repository of high-quality, verified code snippets.

The Retrieval Process

When the system decomposes a new problem, it doesn’t just blindly rewrite the functions. It checks its database to see if it already knows how to solve these sub-problems.

First, the system converts the newly decomposed modules (\(F_i^m\)) into dense vector representations (embeddings) using a representation model (\(M_r\)):

Equation showing the conversion of functional modules into vector representations.

Next, it calculates the similarity between these new modules and the validated modules (\(F_j^v\)) already stored in the database:

Equation showing the calculation of cosine similarity between module vectors.

If a similar, high-quality module is found in the database, it is retrieved.

Finally, the teacher model is asked to generate the final response (\(R_{amr}\)). However, this time, it is provided with the original instruction and the retrieved high-quality modules as “in-context” examples.

Equation showing the generation of the refined AMR response using retrieved modules.

This acts as a reference guide. Instead of hallucinating a solution, the teacher can look at the retrieved modules and say, “Ah, this is the correct way to implement this specific logic,” and integrate it into the final solution.

The Evolution Loop: If the system generates a new module that is not in the database (and is sufficiently different from existing ones), it generates unit tests for it. If the new module passes the tests, it is added to the database. This allows the system’s “knowledge base” to grow and evolve over time.

Training the Student

Once the high-quality dataset \(D_{amr}\) is generated using the pipeline above, the student model is trained using standard supervised fine-tuning. The loss function ensures the student learns to predict the high-quality code tokens:

Equation showing the auto-regressive learning objective for the student model.

Experimental Results

To prove the effectiveness of AMR-Evol, the researchers tested it against three popular coding benchmarks:

HumanEval (HE): A classic Python coding benchmark.
MBPP: Mostly Basic Python Problems.
EvalPlus (HE-Plus / MBPP-Plus): Harder versions of the above benchmarks with significantly more test cases to catch subtle bugs.

They compared AMR-Evol against:

Direct: Standard distillation.
CoT: Chain-of-Thought distillation (asking the teacher to “think step-by-step”).
AnsRepair: Generating unit tests and asking the teacher to fix its own code.

Quantitative Performance

The results were compelling. Table 1 shows the performance when using deepseek-coder-6.7b-base as the student model.

Table 1: Comparison of various response distillation methods for code generation, utilizing deepseek-coder-6.7b-base as the student model.

Across three different levels of instruction complexity, AMR-Evol consistently outperformed the baselines.

HumanEval-Plus: Improvement of +3.1% at complexity level 1.
MBPP-Plus: Improvement of +4.0% at complexity level 1.
Even at higher complexity levels (2 and 3), AMR-Evol maintained a lead, proving that the modular approach scales well with problem difficulty.

Similar results were observed when using CodeLlama-7b-hf as the student model, confirming that the improvements are agnostic to the specific student architecture used.

Table 2: Comparison of various response distillation methods for code generation, utilizing CodeLlama-7b-hf as the student model.

Qualitative Analysis: Does the code look better?

Benchmarks are essential, but manual inspection helps verify if the code is genuinely better or just “gaming” the tests. The researchers employed human programmers to evaluate 120 randomly selected samples.

Figure 3: Manual evaluation of the accuracy of various code response distillation methods across 120 randomly selected samples from each complexity level.

As shown in Figure 3, AMR-Evol (the rightmost bar in each cluster) consistently achieved higher accuracy ratings from human evaluators compared to Direct, CoT, and AnsRepair methods.

A Concrete Example

It is helpful to look at an actual code comparison to see why AMR-Evol wins.

In the example below (Table 11 from the paper), the task is to simulate a custom coin flip game with specific requirements: “efficiently manage a substantial number of players,” “monitor scores in real-time,” and prioritize “minimal memory usage.”

Direct Distillation (Top row) produces a script that uses basic dictionaries and loops. It works, but it’s a procedural script that might not scale well or handle state efficiently.

AMR-Evol (Bottom row) produces an Object-Oriented solution. It defines a CoinFlipGame class. It encapsulates the state (self.scores) and provides methods (coin_flip, update_scores, get_scores). This structural difference reflects the “Modular” nature of the training data. The model learned that complex simulation tasks are better solved with modular classes rather than flat scripts.

Table 11: Examples of different response distillation methods. Our AMR-Evol leads more suitable response.

Ablation Study: Do we need both stages?

The researchers performed an ablation study to check if they could skip the decomposition step or the evolution step.

Table 3: Ablation studies by removing modular decomposition (MD) or adaptive response evolution (ARE) in our framework.

w/o MD (Without Modular Decomposition): They tried retrieving modules based solely on the full response. Performance dropped because a full, complex script is too “noisy” to find good matches in the database. Granularity matters.
w/o ARE (Without Adaptive Response Evolution): They decomposed the code but didn’t use the database to retrieve improved modules. Performance dropped significantly. This proves that the external knowledge (the verified database) is crucial for correcting the teacher’s mistakes.

Conclusion and Implications

The AMR-Evol paper highlights a critical nuance in the development of Large Language Models: Data quality matters more than data quantity, and specific types of data quality matter for specific tasks.

For code generation, “quality” isn’t just about syntax correctness; it’s about structure. By enforcing a modular workflow during the data creation phase, the researchers achieved two things:

They fixed logical errors by breaking complex problems into manageable sub-problems.
They injected “best practices” into the training data by retrieving verified modules from a database.

This approach allows open-source models, which are often orders of magnitude smaller than GPT-4, to punch above their weight class. As the demand for specialized coding assistants grows, frameworks like AMR-Evol offer a blueprint for how to distill the “wisdom” of giant models into efficient, accessible open-source tools without copying their mistakes.

Building Better Code Models: How AMR-Evol Fixes Knowledge Distillation

Introduction

The Problem: The Limits of Direct Distillation

The Solution: The AMR-Evol Framework

Stage 1: Modular Decomposition (MD)