Bridging the Gap: How a 'Mentor' Model Teaches Small AIs to Reason Like Giants

In the current landscape of Artificial Intelligence, Large Language Models (LLMs) like GPT-4 or Claude are the undisputed heavyweights. They possess an “emergent” ability known as Chain-of-Thought (CoT) reasoning—the capability to break down complex problems into step-by-step logical progressions to arrive at a correct answer.

However, there is a catch. These reasoning abilities typically only emerge in models with hundreds of billions of parameters. Running these models requires massive computational resources or expensive API calls, making them impractical for deployment on local devices or in low-resource environments.

So, how do we make small language models (SLMs) smart enough to reason?

The standard answer is Knowledge Distillation (KD)—having a small “student” model learn from a large “teacher.” But a new paper titled “Mentor-KD: Making Small Language Models Better Multi-step Reasoners” identifies a critical flaw in the standard distillation process: rely too heavily on a distant, black-box teacher, and the student struggles to learn effectively.

The researchers propose a novel solution: introducing a Mentor. This intermediate model acts as a bridge, augmenting the training data and providing the “soft” knowledge that black-box APIs hide. In this deep dive, we will explore how Mentor-KD works and why it might be the key to democratizing AI reasoning.

The Problem with Traditional Distillation

To understand Mentor-KD, we first need to understand the limitations of the current standard, often called Reasoning Distillation.

In a typical setup, you take a massive Teacher LLM (like GPT-3.5-Turbo) and ask it to solve problems using Chain-of-Thought prompting (“Let’s think step by step”). You record these reasoning steps (rationales) and the final answers. Then, you fine-tune a small Student model on this generated text.

While this helps, it faces two major hurdles:

Data Scarcity & Quality: Teacher LLMs are generalists. They might not produce enough diverse, high-quality reasoning paths for a specific task. Furthermore, because we usually access them via APIs, generating massive datasets is expensive.
The “Black Box” Issue: Effective distillation usually involves transferring soft labels—the probability distributions over the vocabulary (e.g., the model being 80% sure of word A and 20% sure of word B). This “dark knowledge” tells the student a lot about the teacher’s internal thinking. However, commercial APIs often only return the final text, not the probability distributions. The student loses this rich signal.

The researchers illustrate this difference in Figure 1 below.

Figure 1: Comparison between (a) previous approaches of reasoning distilation and (b) Mentor-KD (ours). Our framework utilizes an intermediate-sized task-specific mentor model to complementthe distilltion sets of teachers.

In approach (a), the student learns directly from the black-box teacher’s text. In approach (b), Mentor-KD, an intermediate “Mentor” model is inserted into the loop. This mentor is fully accessible (white-box), allowing it to provide both more data and those crucial soft labels.

The Mentor-KD Methodology

The core philosophy of Mentor-KD is similar to the academic hierarchy in a university. You have the Professor (Teacher LLM), who is brilliant but busy and somewhat inaccessible. You have the Undergrad (Student SLM), who needs to learn.

Mentor-KD introduces the Teaching Assistant (Mentor). The Mentor is an intermediate-sized model (e.g., FlanT5-Large) that learns from the Professor first. Because the Mentor is dedicated to the specific subject (task-specific fine-tuning), it can often explain things more consistently, generate more practice problems, and most importantly, it is available for deep interrogation (soft labels).

The framework operates in three distinct steps, as outlined in Figure 2.

Step 1: Chain-of-Thought Annotation

First, the system prompts the Teacher LLM with a question and the trigger phrase “Let’s think step by step.” The Teacher generates a rationale and a final answer.

Crucially, the system filters this data. If the Teacher gets the final answer wrong, that rationale is discarded. Only correct reasoning paths are kept to create the initial dataset, \(\mathcal{D}_{\text{teacher}}\).

Step 2: Training the Mentor and Data Augmentation

This is where the magic happens. An intermediate model (the Mentor) is fine-tuned on \(\mathcal{D}_{\text{teacher}}\). Once trained, the Mentor becomes a domain expert for that specific task (e.g., arithmetic or commonsense reasoning).

The Mentor is then asked to generate new rationales for the training questions. Because the Mentor is a generative model, it can produce valid reasoning paths that are different from the Teacher’s. This creates a new, augmented dataset, \(\mathcal{D}_{\text{mentor}}\).

The final training set for the student combines both sources:

Equation for training set combination

This addresses the Data Scarcity problem. The Mentor effectively multiplies the amount of high-quality training data available for the student.

Step 3: Reasoning Distillation

Finally, the Student model is trained. Unlike standard methods that just use text fine-tuning, Mentor-KD employs a dual-objective approach.

1. Rationale Distillation (RD)

The student learns to generate the reasoning steps as text. This is a standard language modeling objective where the model maximizes the likelihood of the correct tokens given the question.

Equation for Rationale Distillation

Here, the model \(f\) tries to predict the reasoning path \(r\) and answer \(y\) given the question \(q\).

2. Soft Label Distillation (SLD)

This is the component that addresses the Black Box problem. Since the Mentor is a local model, we have access to its logits (raw prediction scores). We can convert these logits into a probability distribution using a softmax function with a temperature parameter \(\tau\):

Equation for Softmax Probability

The Student tries to match its own probability distribution (\(p^s\)) to the Mentor’s distribution (\(p^m\)). This is done by minimizing the Kullback-Leibler (KL) Divergence between them:

Equation for Soft Label Distillation

By doing this, the Student isn’t just learning what the answer is; it’s learning the Mentor’s confidence and the nuances of its decision-making process.

The Joint Loss

The final loss function combines these two goals, balanced by a hyperparameter \(\lambda\):

Equation for Total Loss

Experimental Results

The researchers tested Mentor-KD across a variety of complex reasoning tasks, including:

Arithmetic: GSM8K, ASDiv, SVAMP.
Commonsense: StrategyQA, CommonsenseQA.
Logical: Tracking Shuffled Objects, Date Understanding.
Symbolic: Last Letter Concatenation.

The Teacher was GPT-3.5-Turbo. The Mentor was generally a FlanT5-XXL (11B) or Large model, and the Student was a much smaller FlanT5-XL (3B) or smaller.

Main Performance

The results were impressive. As shown in Table 1, Mentor-KD consistently outperformed standard Knowledge Distillation (Vanilla-KD) and other state-of-the-art methods like MCC-KD.

Table 1: Comparison with different baselines on arithmetic and commonsense reasoning tasks.The reported results are averaged accuracy over four runs using randomly selected seeds. Performances marked with an asterisk(*) were excerpted from MCC-KD (Chen et al., 2023). The best results are highlighted in boldface.

Notice the “CommonsenseQA” column. The Mentor-KD student (87.14%) actually outperforms its own Teacher (GPT-3.5 at 74.35%)! This suggests that the Mentor didn’t just pass down knowledge; through task-specific fine-tuning, it refined and concentrated the reasoning ability, which was then successfully transferred to the Student.

To prove this wasn’t a fluke limited to one architecture, the authors also tested on different backbone models (T5 vs. FlanT5) and sizes. Table 3 shows that Mentor-KD (bottom rows of each block) consistently yields the highest accuracy.

Table 3: Performances of teacher, mentor,and student models across four diffrent complex reasoning tasks, where the backbone model is FlanT5. GPT-3.5-Turbo results with an asterisk (*) were excerpted from (Chen et al., 2023). The best and second best results are highlighted in boldface and underline, respectively.

Why Does It Work? A Deeper Analysis

The paper goes beyond just showing high scores; it investigates why the Mentor is so effective.

1. The Power of “More” (Data Augmentation)

Does the Mentor simply help because it generates more data? The researchers varied the “degree” of augmentation—how many reasoning paths the Mentor generated per question.

Figure 3 shows the trend. Generally, as the Mentor generates more rationales (moving right on the x-axis), the Student’s accuracy improves, though it eventually saturates. This confirms that the Mentor is providing useful, diverse reasoning examples that the Student can learn from.

Figure 3: Performances by differentiating the degree (number) of mentor-generated CoT rationales per question.We adopt FlanT5-large and FlanT5-small as mentor and student models, respectively.

2. The Quality of the Mentor

One might worry: “Isn’t the Mentor smaller than the Teacher? Won’t it generate worse data?”

Surprisingly, no. Because the Mentor is fine-tuned specifically for the task, it often becomes more accurate than the generalist Teacher on that specific domain. Figure 4 illustrates this well.

Figure 4: Comparison of (a) accuracy of our mentor model (FlanT5-large) and LLM baselines on teacherincorrect samples,and (b) performances of student models trained with augmented distillation sets from LLM baselines and our mentor models.

In chart (a), we see the Mentor (blue bar) achieving higher accuracy on “Teacher-incorrect samples” compared to other potential LLMs. This means the Mentor is correctly solving problems that stumped the original Teacher. In chart (b), students trained on Mentor-generated data outperform those trained on data from larger models like Llama-3 or Vicuna.

3. Efficiency in Low-Resource Scenarios

A major motivation for this work is cost. Querying GPT-4 for thousands of reasoning paths is expensive. Can Mentor-KD help if we can only afford a small amount of Teacher data?

Figure 5 compares Mentor-KD to Vanilla-KD as the training set size decreases.

Figure 5: Comparison between Mentor-KD (Ours) and Vanilla-KD baseline on various distillation sets by differentiating the percentage of rationales being used.

The difference is stark. In the “Tracking Shuffled Objects” task (left graph), Vanilla-KD (red line) collapses almost immediately if it doesn’t have the full dataset. Mentor-KD (blue line) maintains high performance even with only 40-60% of the original data. The Mentor effectively “fills in the gaps” via augmentation.

4. Does Mentor Size Matter?

Finally, does the Mentor need to be huge? The researchers tested this by varying the Mentor size from “Small” to “XL.”

Figure 6: Comparison between student (FlanT5-small) performance using different mentor models considering various capacity gap sizes. Dotted lines in gray indicate Vanilla-KD baseline performances.

As expected (Figure 6), a larger Mentor (XL) yields a better Student. However, even smaller Mentors often provide a boost over the baseline (dotted gray line), proving that the mechanism of mentorship—providing soft labels and augmentation—is valuable regardless of size.

Conclusion: The Future of Efficient AI

The Mentor-KD framework highlights a crucial insight in AI development: bigger isn’t always better for every part of the pipeline. While we rely on massive LLMs for general intelligence, distilling that intelligence into deployable, small models requires more than just copying text.

By inserting a task-specific Mentor into the loop, we gain three distinct advantages:

Augmentation: We turn a small amount of expensive Teacher data into a large amount of synthetic training data.
Access: We unlock “soft labels” that reveal the internal confidence of the model, which is impossible with black-box APIs.
Specialization: We create a training pipeline where the “Teaching Assistant” can actually surpass the “Professor” in specific tasks, lifting the Student model to new heights.

For students and researchers working with limited resources, Mentor-KD offers a blueprint for building high-performance reasoners without breaking the bank on API costs or server clusters. As AI moves from cloud servers to edge devices, these “mentorship” strategies will likely become the standard for training the next generation of efficient models.

The Problem with Traditional Distillation#

The Mentor-KD Methodology#

Step 1: Chain-of-Thought Annotation#

Step 2: Training the Mentor and Data Augmentation#

Step 3: Reasoning Distillation#

1. Rationale Distillation (RD)#

2. Soft Label Distillation (SLD)#

The Joint Loss#

Experimental Results#

Main Performance#

Why Does It Work? A Deeper Analysis#

1. The Power of “More” (Data Augmentation)#

2. The Quality of the Mentor#

3. Efficiency in Low-Resource Scenarios#

4. Does Mentor Size Matter?#

Conclusion: The Future of Efficient AI#