Introduction
In the current landscape of Large Language Models (LLMs), “Chain of Thought” (CoT) prompting has become a dominant paradigm. We have all seen the magic: if you ask a model like GPT-4 to “think step-by-step,” its ability to solve complex math word problems or commonsense reasoning tasks improves dramatically.
Naturally, researchers asked the next logical question: Can we use these reasoning chains to teach smaller models?
This process is known as CoT-Augmented Distillation. The idea is simple: you take a massive “teacher” model (like GPT-4 or Mistral), generate questions with step-by-step rationales, and then fine-tune a tiny “student” model (like GPT-2 or a 2B parameter model) on that data. The hope is that the student won’t just learn the answer; it will learn how to think.
And it works. Empirical studies show that small models trained this way significantly outperform those trained on just questions and answers.
But here is the catch: We don’t actually know why.
Are these small models truly learning to reason? Are they internalizing the logic of the teacher? Or are they just learning statistical shortcuts that look like reasoning?
In the paper “Investigating Mysteries of CoT-Augmented Distillation,” researchers from Northeastern University decided to stress-test this methodology. They performed a series of fascinating ablations—scrambling words, masking text, and moving the reasoning to the end of the sentence—to understand the mechanism behind the performance boost.
Their findings are surprising and challenge the assumption that small models are learning to “think.” In this post, we will tear down the paper, explain the experiments in detail, and look at what this means for the future of model training.
Background: Distillation and Reasoning
Before diving into the experiments, let’s establish the baseline.
Knowledge Distillation is a technique where a small model (Student) learns to mimic the behavior of a large model (Teacher). Classically, this involved the student trying to match the output probabilities (logits) of the teacher.
CoT-Augmented Distillation changes the data rather than just the loss function.
- Input: A question (e.g., “Why did the car slide?”).
- Teacher Output: A rationale (“Ice has less friction than pavement…”) + The Answer (“The ice”).
- Student Training: The student is fine-tuned to generate the Rationale followed by the Answer given the Input.
The prevailing theory is that by forcing the student to generate the reasoning steps before the answer, the model attends to the relevant features and derives the answer logically, just like the teacher did. This is referred to as Pre-CoT (Rationale \(\rightarrow\) Label).
The authors of this paper challenge this theory by introducing Post-CoT (Label \(\rightarrow\) Rationale) and asking a fundamental question: If we put the answer first, and the reasoning second, the reasoning cannot possibly help the model find the answer during inference (since the answer is already generated). So, if Post-CoT works, the “reasoning” hypothesis might be wrong.
Core Method: The Experimental Setup
The researchers set up a controlled environment to test three specific questions regarding CoT distillation.
The Architecture
They utilized a standard distillation setup:
- Teacher: Mistral-7B-Instruct (a competent open-source model).
- Students: Small decoder-only models: GPT-2, Phi-1.5, and Gemma-2B.
- Datasets: Commonsense reasoning datasets (CommonsenseQA, OpenBookQA, and QuaRel).
The core of their methodology revolves around manipulating the training targets. As illustrated below, they contrast the standard approach with a new, inverted approach.

Figure 1 outlines the two main strategies:
- Pre-CoT (The Standard): The model is trained to output
[Rationale] [Label]. At inference time, the student generates the reasoning, which supposedly guides it to the correct label. - Post-CoT (The Intervention): The model is trained to output
[Label] [Rationale].
Crucial Note: In the Post-CoT setting, when the model is used in the real world (inference), it generates the label immediately. It does not need to generate the rationale to get the answer. The rationale is only there as a training signal to update the weights during the fine-tuning process.
If the “learning to reason” hypothesis is true, Pre-CoT should vastly outperform Post-CoT, because Pre-CoT allows the model to “think” before answering. Let’s see what happened.
Experiment 1: Does Placement Matter?
The first “mystery” the authors investigated was the positioning of the rationale.
The Results
The researchers trained the student models on both configurations and compared them against a baseline (No CoT, just Question \(\rightarrow\) Answer).

Table 1 reveals a striking result. Look at the rows for CoT after Label (Post-CoT). Across almost every dataset and model (GPT-2, Phi-1.5, Gemma-2B), placing the reasoning after the label results in superior performance compared to placing it before.
Why is this happening?
This result suggests that the student model does not need to perform “inference-time reasoning” to benefit from the data. The mere presence of the rationale tokens during the backward pass (training updates) allows the model to learn better representations of the input.
To understand how the model learns differently, the authors used a technique called TunedLens. This technique allows us to peek inside the layers of the Transformer to see what the model is “thinking” at different depths.

Figure 2 shows the confidence of the model in predicting the correct label (y-axis) as we move deeper through the layers (x-axis) for three different setups:
- Left (No CoT): The model is unsure until the very last layers.
- Middle (Pre-CoT): The model gains confidence earlier.
- Right (Post-CoT): The model “locks in” on the correct answer extremely early in the network (around layer 20-30).
This indicates that training with the rationale after the label creates a stronger gradient signal that forces the earlier layers of the network to recognize the correct answer. The rationale acts as a powerful regularizer or feature-highlighter during training, even if it isn’t generated during testing.
We can see this even more clearly in the individual heatmap breakdowns for a specific example about electric cars:

Figure 6 (No-CoT) shows the baseline model only becoming confident (red color) at the very final layer (Layer 40+).

Figure 8 (Post-CoT), however, shows the model figuring out the answer is “C” as early as Layer 27. This proves that the training signal from the future tokens (the rationale) has backpropagated to organize the early layers more efficiently.
Is it just “More Compute”?
A common counter-argument in LLM research is that CoT works simply because generating more tokens gives the model more “time” (computational depth) to process the answer.
The authors tested this by padding the input with <unk> (unknown) tokens instead of reasoning words.

Figure 3 shows that while adding some dummy tokens helps slightly (the “compute” argument has some merit), it plateaus quickly and never reaches the performance of true CoT. This proves that the content of the rationale matters, not just the length.
Experiment 2: The Coherence Check
If the content matters, does the logic matter?
In the Pre-CoT paradigm, we assume the model learns semantic dependencies: “A implies B, therefore C.” If this is true, the grammar and order of the rationale words should be vital.
The researchers performed a “Shuffling” ablation. They took the coherent rationales generated by the teacher and randomized the word order.
- Original: “The answer is B because ice has less friction.”
- Shuffled: “Friction less ice because B answer is the.”
The Results

Table 2 presents perhaps the most counter-intuitive finding of the paper:
- Pre-CoT (Shuffled): Performance crashes. This makes sense—if the model has to generate garbage text before the answer, it confuses itself.
- Post-CoT (Shuffled): Performance remains almost identical to the coherent CoT.
Takeaway: When the rationale is placed after the label, the model does not care about grammar or logic. It essentially treats the rationale as a “bag of words.” It learns that when the answer is “Ice,” the future tokens will likely contain “friction,” “smooth,” and “slippery.” It associates these semantic concepts with the answer label, regardless of the sentence structure.
Masking the Rationale
To push this further, they tried deleting parts of the rationale. How much of the “explanation” can we delete before the benefit disappears?

Figure 4 illustrates that for Post-CoT (the orange line), you can mask (delete) nearly 60% of the rationale tokens without significantly hurting performance. The model only needs a few key words to maintain the performance boost.
Experiment 3: Finding the “Golden” Tokens
If the model only needs a “bag of words,” and it can survive 60% deletion, which words are actually doing the heavy lifting?
The authors used a method called Integrated Gradients (IG) to identify the specific tokens in the rationale that contributed most to the prediction of the label.
The Math
Integrated Gradients approximates the integral of the gradients of the model’s output with respect to the input.

In simple terms, this equation asks: “If I change this specific word in the rationale, how much does the probability of the correct answer change?”
They extracted the top 15 most important tokens according to this metric and threw away the rest of the rationale. They also compared this to words selected by humans as being important.

Figure 5 visualizes the difference. The algorithm (Left) often picks tokens that humans might ignore, or statistical correlations that are specific to the dataset, whereas humans (Right) pick words that semantically explain the answer.
The Results

Table 3 shows the results of training the student model using only these 15 attributed tokens appended to the label (Post-CoT style).
- Grad Attr (Integrated Gradients): The performance is on par with the full, coherent CoT.
- Human Labels: Performance drops significantly.
- Word2Vec (Just similar words): Performance drops to baseline.
What does this mean?
This confirms that the benefit of CoT distillation comes from specific key tokens that provide rich semantic signals.
Crucially, the tokens that help the model are not necessarily the ones humans think are important. They are the tokens that have the highest gradient leverage within the high-dimensional space of the LLM. Simply finding words that are “similar” to the answer (Word2Vec) isn’t enough; the model needs the specific contextual tokens that the teacher model produced.
Conclusion and Implications
This paper peels back the curtain on one of the most popular techniques in modern NLP. The findings are sobering but incredibly useful for practitioners.
Key Takeaways
- Post-CoT is Superior: You do not need to force small models to generate long chains of reasoning at inference time. You can get better accuracy by training them with the reasoning after the answer. This makes your model faster and cheaper to run while keeping the accuracy gains.
- Logic is Optional (for Distillation): The student model isn’t learning the step-by-step logic of the teacher. It is learning a probabilistic association between the “Question + Answer” and a specific “Bag of Reasoning Words.”
- Efficiency: You can strip down your training data. You don’t need full paragraphs of explanation. A small set of high-gradient keywords appended to your training labels can yield the same results as full CoT training.
The “Why”
The authors conclude that CoT-augmented distillation works not by teaching “thinking,” but by feature enrichment.
When we force the model to predict the rationale (even after the label), we force the internal representations (the hidden states) to contain information about those rationale words. Because those rationale words are semantically related to the correct answer, the model’s representation of the answer becomes richer and more distinct from the incorrect options.
This paper is a reminder that in deep learning, anthropomorphizing our models—assuming they “think” like us—can lead us astray. Sometimes, what looks like reasoning is just really, really good statistical correlation. And knowing that allows us to build more efficient, robust systems.
](https://deep-paper.org/en/paper/2406.14511/images/cover.png)