Bridging the Gap: How Dual-Space Knowledge Distillation Unifies Teacher and Student LLMs
The current era of Artificial Intelligence is defined by the “Scaling Law.” We have seen that increasing the parameter count of Large Language Models (LLMs) consistently yields better generalization and reasoning capabilities. However, this pursuit of intelligence comes with a hefty price tag. Models like LLaMA-70B or GPT-4 are massive, making them incredibly expensive to deploy and slow to run in real-world scenarios.
This has led to a surge in interest in Knowledge Distillation (KD). The premise of KD is elegant: take a massive, intelligent “Teacher” model and use it to train a smaller, faster “Student” model. The goal is to compress the teacher’s knowledge into a lightweight package without losing too much performance.
However, researchers from Beijing Jiaotong University have identified a critical flaw in how we currently perform this distillation. In their paper, Dual-Space Knowledge Distillation for Large Language Models, they argue that the standard approach ignores a fundamental “space discrepancy” between the teacher and the student.
In this post, we will dive deep into this research. We will explore why current methods fail to transfer knowledge effectively, how the proposed Dual-Space Knowledge Distillation (DSKD) framework fixes this by unifying the models’ output spaces, and how a novel attention mechanism allows this to work even when models speak different “languages” (vocabularies).
The Status Quo: White-Box Knowledge Distillation
To understand the innovation, we must first understand the baseline. In standard “White-Box” KD, the student model tries to mimic the probability distribution of the teacher.
When an LLM processes a sequence of text, it calculates a probability for every possible next token. The student is usually trained on two objectives simultaneously:
- Standard Learning: Predicting the correct next word from the dataset (Cross-Entropy Loss).
- Distillation: Minimizing the difference between its predicted probability distribution and the teacher’s distribution.
Mathematically, the standard language modeling loss looks like this:

Here, the model optimizes the probability of the correct token \(x_i^*\).
Simultaneously, the KD loss forces the student \(q_\theta\) to match the teacher \(p\):

\(\mathcal{D}\) represents a distance function, commonly Kullback-Leibler (KL) divergence. The idea is simple: if the Teacher thinks “dog” is 80% likely and “cat” is 20% likely, the Student should learn that distribution, capturing the nuances of the Teacher’s reasoning.
The Hidden Problem: Space Discrepancy
The researchers argue that the equation above hides a significant issue. The Student and the Teacher are different models. They have different architectures, different hidden dimension sizes, and crucially, different prediction heads.
The prediction head is the final layer of the neural network that transforms internal hidden states into the probability distribution over the vocabulary. Because the Student and Teacher use their own prediction heads, their output distributions live in different “spaces.”
The authors hypothesized that minimizing the distance between distributions from different spaces is inefficient. It forces the distributions to look alike, but it doesn’t necessarily force the internal representations (the model’s “thoughts”) to align.
Visualizing the Failure
To prove this, the authors conducted a simulation. They created a synthetic Teacher and Student with 2D hidden states. They ran standard KD (minimizing KL divergence) and visualized the internal hidden states before and after training.

Look closely at Figure 1 above:
- (b) After KD (different heads): This represents the current standard method. Even after training, the Student’s representations (red triangles) and Teacher’s representations (blue stars) are structurally distinct. They don’t overlap. The knowledge transfer is incomplete.
- (c) After KD (shared head): When the authors forced the models to share the same prediction head (unifying the output space), the representations aligned perfectly.
- (d) Loss Curves: The orange line shows that sharing the output space leads to much faster convergence and a lower final loss compared to the standard approach (blue line).
This simulation confirmed the hypothesis: The space discrepancy limits the similarity between the teacher and student.
The Solution: Dual-Space Knowledge Distillation (DSKD)
The authors propose a new framework: Dual-Space Knowledge Distillation (DSKD). The core idea is to project the hidden states of one model into the representation space of the other, effectively allowing them to share prediction heads.
Instead of just comparing outputs in their separate spaces, DSKD performs distillation in two unified spaces: the Student Space and the Teacher Space.
1. Distillation in the Student Space
First, we want to bring the Teacher’s knowledge into the Student’s world.
The Teacher’s hidden states (\(\mathbf{h}^t\)) usually have a larger dimension than the Student’s. We use a trainable linear projector \(\mathcal{P}^{t \to s}\) to shrink the Teacher’s hidden states down to the Student’s dimension (\(d\)).

Once projected, we pass these “student-sized” Teacher states through the Student’s prediction head (\(\mathbf{W}^s\)). This generates a probability distribution \(\mathbf{p}^{t \to s}\).

Because the projector is initialized randomly, we need to train it to make sure it preserves the Teacher’s knowledge. We add a loss term to ensure the projected Teacher states can still predict the correct ground-truth tokens:

Now, both the Student (naturally) and the Teacher (via projection) are producing distributions using the same prediction head (\(\mathbf{W}^s\)). They are in the same output space. We can now calculate the KD loss in the Student Space:

2. Distillation in the Teacher Space
DSKD is symmetric. We also want to push the Student to understand the Teacher’s world.
We use another projector \(\mathcal{P}^{s \to t}\) to expand the Student’s hidden states up to the Teacher’s dimension (\(D\)).

We then pass these projected Student states through the Teacher’s prediction head (\(\mathbf{W}^t\)) to get a distribution \(\mathbf{q}^{s \to t}\).

Since the Teacher’s head is already fixed and pre-trained, we don’t need an auxiliary loss here. We simply calculate the KD loss (specifically KL divergence) in the Teacher Space:

3. The Unified Objective
The final training objective combines the distillation losses from both spaces, along with the auxiliary loss used to train the projector.

By minimizing this combined loss, DSKD ensures that the Student mimics the Teacher not just in final probability, but in the internal representation structure, maximizing the similarity between the two models.
Solving the Vocabulary Mismatch: Cross-Model Attention
There is one major hurdle left. The method described above assumes the Student and Teacher share the same vocabulary (and thus the same tokenizer). If they share a vocabulary, their sequence lengths are identical.
But in the modern LLM landscape, this is rarely true. You might want to distill knowledge from Qwen (different vocabulary) into GPT-2, or Mistral into TinyLlama. If the tokenizers are different, the sequence I like AI might be 3 tokens for the Student and 4 tokens for the Teacher. You cannot simply project the hidden states because the sequence lengths (\(n\) and \(m\)) don’t match.
To solve this, the authors introduce a Cross-Model Attention (CMA) mechanism.
Aligning Tokens with Attention
Instead of a simple linear projection, the authors treat the alignment problem as an attention task.
- Query (Student): The Student’s embeddings and target tokens act as the Query (\(Q\)).
- Key & Value (Teacher): The Teacher’s embeddings and hidden states act as the Key (\(K\)) and Value (\(V\)).

By computing the attention between the Student’s query and the Teacher’s key, the model learns an Attention Matrix (\(\mathbf{a}^{t \to s}\)). This matrix represents the alignment relationship—it tells us which tokens in the Teacher’s sequence correspond to a specific token in the Student’s sequence.

We then use this matrix to compute a weighted sum of the Teacher’s values. The result is a sequence of Teacher hidden states that is perfectly aligned—token by token—with the Student’s sequence length.

This aligned representation \(\tilde{\mathbf{h}}\) can now be plugged directly into the DSKD framework described in the previous section. The authors also perform the reverse operation (Student-to-Teacher alignment) to support dual-space distillation.

This mechanism makes DSKD a universal framework, applicable to any pair of LLMs regardless of their vocabularies.
Experimental Results
The authors evaluated DSKD on instruction-following benchmarks (like Dolly, Self-Instruct, and Vicuna-Evaluation). They tested two scenarios: models with the same vocabulary and models with different vocabularies.
Scenario 1: Same Vocabulary (GPT-2)
In the first set of experiments, they distilled a GPT2-1.5B (Teacher) into a GPT2-120M (Student). They compared DSKD against standard KD methods like vanilla KL, Reverse KL (RKL), and Jensen-Shannon (JS) divergence.

As shown in Table 1, adding the “w/ DSKD” component consistently improved performance across all distance functions. For example, using DSKD with JS divergence boosted the average Rouge-L score from 16.98 to 18.37—a significant jump.
Scenario 2: Same Vocabulary (LLaMA)
They scaled up the experiment using LLaMA2-7B as the teacher and TinyLLaMA-1.1B as the student.

Table 2 confirms the trend. DSKD achieved substantial gains. Notably, using Adaptive KL (AKL) with DSKD resulted in a massive performance increase, raising the average score from 23.30 to 26.65.
Scenario 3: Different Vocabularies
The bottom sections of Table 1 and Table 2 (labeled “Different Vocabularies”) show the power of the Cross-Model Attention (CMA) mechanism.
- Qwen \(\to\) GPT-2: DSKD-CMA outperformed existing cross-tokenizer methods like MinED and ULD.
- Mistral \(\to\) TinyLLaMA: DSKD-CMA-AKL achieved an average score of 26.99, beating the best baseline (MinED) by over 5 points. Interestingly, this result was even better than the LLaMA2 \(\to\) TinyLLaMA distillation, likely because Mistral is a stronger teacher than LLaMA2.
Analysis: Why does it work?
To verify that the “Dual-Space” concept was the driver of success, the authors performed an ablation study. They compared:
- Diff. Space: Standard KD (different heads).
- Student Space: Only projecting Teacher \(\to\) Student.
- DSKD: Both Student and Teacher spaces.

Table 3 shows that simply moving the distillation into the Student Space provides a major boost. Adding the Teacher Space (DSKD) refines the performance further.
Furthermore, the authors measured the actual geometric distance between the representation structures of the Student and Teacher after training.

Figure 3 validates the initial hypothesis: DSKD (orange box) results in a much smaller distance between the representation structures compared to Vanilla KD (green box). The student is truly “thinking” more like the teacher.
Finally, they asked GPT-4 to judge the quality of the generated responses.

As seen in Figure 2, the DSKD-trained model was preferred by GPT-4 in the majority of cases compared to the baseline model.
Conclusion
The paper Dual-Space Knowledge Distillation highlights a subtle but critical inefficiency in how we compress Large Language Models. By treating the Teacher and Student as inhabiting separate vector spaces, standard Knowledge Distillation fails to fully align their representations.
DSKD solves this by explicitly projecting hidden states to unify the output spaces. When combined with Cross-Model Attention, it provides a robust framework that works even when models use completely different tokenizers.
As we continue to rely on massive LLMs, techniques like DSKD will be essential. They allow us to extract the high-level reasoning of massive models and inject it into compact, deployable students, making AI more accessible and efficient for everyone.
Experimental Details Appendix
For those interested in the specific hyperparameters used to achieve these results, the authors provided their training configurations, noting that the distillation temperature \(\tau\) was set to 2.0 based on validation performance.

The overall loss function balanced the task learning and distillation equally:

](https://deep-paper.org/en/paper/2406.17328/images/cover.png)