The landscape of Natural Language Processing (NLP) is currently dominated by a “bigger is better” mentality. We marvel at the emergent abilities of massive Large Language Models (LLMs) like GPT-4 or Claude 3—specifically their ability to perform Chain-of-Thought (CoT) reasoning. This is the ability to break down a complex problem into intermediate steps to arrive at a solution, much like a human does with a math problem.
However, this capability usually fades when we shrink the model size to something that can run on consumer hardware or edge devices. Small Language Models (SLMs) typically struggle to “think” through problems.
The standard solution has been Supervised Fine-Tuning (SFT): we take a massive, smart teacher (like GPT-4), generate thousands of step-by-step reasoning examples, and train the small student to mimic them. While this works to an extent, it often leads to mere imitation. The student memorizes the specific reasoning paths of the teacher but fails to generalize to new, unseen problems.
In a fascinating paper titled “Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models,” researchers Leonardo Ranaldi and Andrè Freitas propose a new paradigm. Instead of just forcing the student to mimic the teacher, they allow the student to “self-refine” its understanding using preference optimization.
In this post, we will deconstruct this method, explore the mathematics behind their self-refinement technique, and analyze why this might be the key to democratizing AI reasoning.
The Problem: The Imitation Trap
Before diving into the solution, we need to understand the limitation of current methods. When we try to transfer reasoning skills from a Teacher (LLM) to a Student (SLM), we usually rely on Instruction Tuning.
The teacher generates a dataset of tuples: (Instruction, Question, Answer). The Answer here isn’t just “42”; it is a detailed Chain-of-Thought explanation. The student model is then trained to predict this exact sequence of words.
The issue? Generalization. There are often multiple valid ways to solve a problem. By forcing the student to minimize the difference between its output and one specific teacher output, we limit the student’s flexibility. It learns to parrot the teacher’s specific phrasing rather than learning the underlying reasoning process.
The Solution: Self-Refine Instruction-Tuning
The researchers introduce a two-stage pipeline designed to bridge this gap.

As illustrated in Figure 1, the process flows from left to right:
- Phase 1: Instruction-Tuning (The Warm Up): The Student (SLM) is initially trained on demonstrations provided by the Teacher (LLM). This gives the student a baseline capability.
- Phase 2: Self-Refinement (The Core Contribution): The Student then enters a loop of self-improvement using Direct Preference Optimization (DPO). It generates its own reasoning paths, evaluates them, and updates its internal policy to prefer reasoning that leads to correct answers.
Let’s break these phases down mathematically.
Phase 1: Instruction-Tuning
In this first phase, the goal is knowledge transfer. We have a dataset generated by a teacher, consisting of instructions (\(i\)), questions (\(q\)), and chain-of-thought answers (\(a_i\)).
The answer \(a_i\) is a sequence of tokens:

The model operates in states (\(s_t\)), where the state at time \(t\) depends on the input instruction/question and all previously generated tokens:

The training objective is standard for language models: maximize the likelihood of generating the teacher’s exact tokens. This is calculated via the instruction loss function:

This equation essentially says: “Adjust the model parameters \(\theta\) so that the probability of producing the teacher’s words (\(w_t\)), given the context, is maximized.”
After this phase, we have a “decent” student model. It can follow instructions and reason a little bit, but it is fragile. This is where most previous research stops. This paper, however, is just getting started.
Phase 2: The Self-Refinement via DPO
To make the student robust, the researchers employ Direct Preference Optimization (DPO).
DPO is a technique usually used to align models with human preferences (e.g., “don’t be toxic,” “be helpful”). It typically requires a dataset where a human (or a strong AI) looks at two answers and says, “Answer A is better than Answer B.”
The genius of Self-Refine Instruction-Tuning is how they adapt DPO for reasoning without needing a separate reward model or human annotators.
The Setup
For every question in the dataset, the researchers ask the Student Model (which was already warmed up in Phase 1) to generate answers in two ways:
- Standard Answer: Just the answer.
- CoT Answer: The answer generated with the prompt “Let’s think step by step.”
They then define a preference pair (\(y_w, y_l\))—a winner and a loser.
- The Winner (\(y_w\)): A generated response that includes Chain-of-Thought reasoning AND arrives at the correct ground-truth answer.
- The Loser (\(y_l\)): A response that is either incorrect or fails to use reasoning.
The goal is to update the model so that it inherently prefers generating the “Winning” style (correct CoT) over the “Losing” style.
The Math of Refinement
The standard DPO loss function looks like this:

Where \(\sigma\) is the sigmoid function and \(M\) represents the margin between the winning and losing policies:

This formula looks intimidating, but it essentially penalizes the model if the probability it assigns to the “loser” answer is higher than the probability it assigns to the “winner” answer, scaled by a reference model (\(\pi_{sft}\)).
The researchers customized this specifically for Chain-of-Thought (\(DPO_{CoT}\)). They explicitly define the “Winner” (\(y_w\)) as the self-generated correct reasoning path:

And the selection criteria for the winner is strictly defined by correctness (\(t_i\) is the target/ground truth):

If the student generates a correct CoT, that becomes the training target. If it doesn’t, they fall back to the teacher’s demonstration (\(\hat{a}_i\)). This creates a self-reinforcing loop: the model practices on its own successful reasoning attempts, which are inherently more aligned with its own internal distribution than the teacher’s demonstrations.
Experiments & Results
The researchers tested this method on varied benchmarks, including common sense reasoning (OpenBookQA, PIQA) and math problems (GSM8K, MultiArith). They used Llama-2 (7B and 13B) and Mistral-7B as students, and giant models like Llama-2-70B and GPT-3.5 as teachers.
1. In-Family vs. Out-Family Alignment
One of the biggest challenges in distillation is “Out-Family” alignment. It is easier for a small Llama model to learn from a big Llama model because they share the same underlying architecture and tokenizer (In-Family). It is much harder for a Llama model to learn from GPT-3.5 (Out-Family).
In-Family Performance (Llama-2-70B \(\to\) Llama-2-7/13B):

In Figure 2, look at the transition from the blue bars (Instruction Tuning) to the red/orange bars (Self-Refine). The Self-Refine step consistently boosts accuracy, often closing the gap significantly between the student and the massive teacher model (represented by the horizontal lines).
Out-Family Performance (GPT-3.5 \(\to\) Llama/Mistral):

Figure 3 is perhaps even more impressive. Here, the students are learning from GPT-3.5 (a completely different model). The “Instruction-Tuned” models (blue) struggle to match the teacher. However, the Self-Refine models (orange/red) make massive jumps in performance.
This proves that DPO allows the student to internalize the reasoning logic of the teacher, even if the phrasing and tokenization are completely different.
2. Can It Generalize?
A major criticism of fine-tuning is that models become “one-trick ponies.” If you train them on science questions, they forget how to do math.
The researchers tested Cross-Domain Self-Refinement. They instructed a model on one task (e.g., OpenBookQA) and then applied the Self-Refine phase using data from a different task (e.g., CommonSenseQA).

Table 1 highlights the “Cross Self-refine” rows. Notice that models refined on different tasks (out-domain) often performed nearly as well as, and sometimes better than, baselines. This suggests that the reasoning capability itself is being improved, not just the memorization of a specific dataset.
3. Efficiency with Less Data
Finally, does this require massive amounts of data? The researchers reduced the number of demonstrations to 25%, 50%, and 75% of the original dataset.

Figure 4 shows that even with only 25% or 50% of the data, the Self-Refined models (solid lines) consistently outperform the standard Instruction-Tuned models (dotted/dashed lines). This implies that quality (via refinement) beats quantity. A model that reflects on its own correct reasoning learns faster than one that blindly consumes teacher data.
Conclusion and Implications
The paper “Self-Refine Instruction-Tuning” offers a compelling path forward for the open-source AI community. It addresses the “reasoning gap” between proprietary giants like GPT-4 and accessible local models.
The key takeaways are:
- Don’t just imitate: Standard Instruction Tuning is a good start, but it’s not enough for deep reasoning.
- Self-Correction is powerful: Using DPO to reinforce the student’s own correct reasoning paths leads to better generalization than standard supervised learning.
- Cross-compatibility: This method works exceptionally well even when the teacher and student are from different model families.
By allowing small models to “practice” and refine their own thoughts, we move closer to a future where high-level reasoning isn’t locked behind massive API paywalls, but available on the devices we use every day.
](https://deep-paper.org/en/paper/2405.00402/images/cover.png)