Bridging the Gap: How Dual-Space Knowledge Distillation Unifies Teacher and Student LLMs

The current era of Artificial Intelligence is defined by the “Scaling Law.” We have seen that increasing the parameter count of Large Language Models (LLMs) consistently yields better generalization and reasoning capabilities. However, this pursuit of intelligence comes with a hefty price tag. Models like LLaMA-70B or GPT-4 are massive, making them incredibly expensive to deploy and slow to run in real-world scenarios.

This has led to a surge in interest in Knowledge Distillation (KD). The premise of KD is elegant: take a massive, intelligent “Teacher” model and use it to train a smaller, faster “Student” model. The goal is to compress the teacher’s knowledge into a lightweight package without losing too much performance.

However, researchers from Beijing Jiaotong University have identified a critical flaw in how we currently perform this distillation. In their paper, Dual-Space Knowledge Distillation for Large Language Models, they argue that the standard approach ignores a fundamental “space discrepancy” between the teacher and the student.

In this post, we will dive deep into this research. We will explore why current methods fail to transfer knowledge effectively, how the proposed Dual-Space Knowledge Distillation (DSKD) framework fixes this by unifying the models’ output spaces, and how a novel attention mechanism allows this to work even when models speak different “languages” (vocabularies).

The Status Quo: White-Box Knowledge Distillation

To understand the innovation, we must first understand the baseline. In standard “White-Box” KD, the student model tries to mimic the probability distribution of the teacher.

When an LLM processes a sequence of text, it calculates a probability for every possible next token. The student is usually trained on two objectives simultaneously:

  1. Standard Learning: Predicting the correct next word from the dataset (Cross-Entropy Loss).
  2. Distillation: Minimizing the difference between its predicted probability distribution and the teacher’s distribution.

Mathematically, the standard language modeling loss looks like this:

Standard Cross-Entropy Loss equation.

Here, the model optimizes the probability of the correct token \(x_i^*\).

Simultaneously, the KD loss forces the student \(q_\theta\) to match the teacher \(p\):

Standard Knowledge Distillation Loss equation.

\(\mathcal{D}\) represents a distance function, commonly Kullback-Leibler (KL) divergence. The idea is simple: if the Teacher thinks “dog” is 80% likely and “cat” is 20% likely, the Student should learn that distribution, capturing the nuances of the Teacher’s reasoning.

The Hidden Problem: Space Discrepancy

The researchers argue that the equation above hides a significant issue. The Student and the Teacher are different models. They have different architectures, different hidden dimension sizes, and crucially, different prediction heads.

The prediction head is the final layer of the neural network that transforms internal hidden states into the probability distribution over the vocabulary. Because the Student and Teacher use their own prediction heads, their output distributions live in different “spaces.”

The authors hypothesized that minimizing the distance between distributions from different spaces is inefficient. It forces the distributions to look alike, but it doesn’t necessarily force the internal representations (the model’s “thoughts”) to align.

Visualizing the Failure

To prove this, the authors conducted a simulation. They created a synthetic Teacher and Student with 2D hidden states. They ran standard KD (minimizing KL divergence) and visualized the internal hidden states before and after training.

A composite image showing simulation results. Subplot (b) shows that with different prediction heads, student and teacher representations remain distinct. Subplot (c) shows that sharing a prediction head aligns them perfectly. Subplot (d) shows the loss converges much faster with a shared head.

Look closely at Figure 1 above:

  • (b) After KD (different heads): This represents the current standard method. Even after training, the Student’s representations (red triangles) and Teacher’s representations (blue stars) are structurally distinct. They don’t overlap. The knowledge transfer is incomplete.
  • (c) After KD (shared head): When the authors forced the models to share the same prediction head (unifying the output space), the representations aligned perfectly.
  • (d) Loss Curves: The orange line shows that sharing the output space leads to much faster convergence and a lower final loss compared to the standard approach (blue line).

This simulation confirmed the hypothesis: The space discrepancy limits the similarity between the teacher and student.

The Solution: Dual-Space Knowledge Distillation (DSKD)

The authors propose a new framework: Dual-Space Knowledge Distillation (DSKD). The core idea is to project the hidden states of one model into the representation space of the other, effectively allowing them to share prediction heads.

Instead of just comparing outputs in their separate spaces, DSKD performs distillation in two unified spaces: the Student Space and the Teacher Space.

1. Distillation in the Student Space

First, we want to bring the Teacher’s knowledge into the Student’s world.

The Teacher’s hidden states (\(\mathbf{h}^t\)) usually have a larger dimension than the Student’s. We use a trainable linear projector \(\mathcal{P}^{t \to s}\) to shrink the Teacher’s hidden states down to the Student’s dimension (\(d\)).

Equation showing the projection of teacher hidden states to student dimension.

Once projected, we pass these “student-sized” Teacher states through the Student’s prediction head (\(\mathbf{W}^s\)). This generates a probability distribution \(\mathbf{p}^{t \to s}\).

Equation showing the generation of teacher probabilities using the student’s prediction head.

Because the projector is initialized randomly, we need to train it to make sure it preserves the Teacher’s knowledge. We add a loss term to ensure the projected Teacher states can still predict the correct ground-truth tokens:

Cross-entropy loss equation for the projected teacher states.

Now, both the Student (naturally) and the Teacher (via projection) are producing distributions using the same prediction head (\(\mathbf{W}^s\)). They are in the same output space. We can now calculate the KD loss in the Student Space:

Knowledge Distillation loss equation in the Student Space.

2. Distillation in the Teacher Space

DSKD is symmetric. We also want to push the Student to understand the Teacher’s world.

We use another projector \(\mathcal{P}^{s \to t}\) to expand the Student’s hidden states up to the Teacher’s dimension (\(D\)).

Equation showing the projection of student hidden states to teacher dimension.

We then pass these projected Student states through the Teacher’s prediction head (\(\mathbf{W}^t\)) to get a distribution \(\mathbf{q}^{s \to t}\).

Equation showing the generation of student probabilities using the teacher’s prediction head.

Since the Teacher’s head is already fixed and pre-trained, we don’t need an auxiliary loss here. We simply calculate the KD loss (specifically KL divergence) in the Teacher Space:

Knowledge Distillation loss equation in the Teacher Space.

3. The Unified Objective

The final training objective combines the distillation losses from both spaces, along with the auxiliary loss used to train the projector.

The total DSKD loss equation summing the student space KD, teacher space KD, and projector cross-entropy.

By minimizing this combined loss, DSKD ensures that the Student mimics the Teacher not just in final probability, but in the internal representation structure, maximizing the similarity between the two models.

Solving the Vocabulary Mismatch: Cross-Model Attention

There is one major hurdle left. The method described above assumes the Student and Teacher share the same vocabulary (and thus the same tokenizer). If they share a vocabulary, their sequence lengths are identical.

But in the modern LLM landscape, this is rarely true. You might want to distill knowledge from Qwen (different vocabulary) into GPT-2, or Mistral into TinyLlama. If the tokenizers are different, the sequence I like AI might be 3 tokens for the Student and 4 tokens for the Teacher. You cannot simply project the hidden states because the sequence lengths (\(n\) and \(m\)) don’t match.

To solve this, the authors introduce a Cross-Model Attention (CMA) mechanism.

Aligning Tokens with Attention

Instead of a simple linear projection, the authors treat the alignment problem as an attention task.

  1. Query (Student): The Student’s embeddings and target tokens act as the Query (\(Q\)).
  2. Key & Value (Teacher): The Teacher’s embeddings and hidden states act as the Key (\(K\)) and Value (\(V\)).

Equations defining Query, Key, and Value matrices for cross-model attention. Equations defining Key and Value matrices derived from the teacher.

By computing the attention between the Student’s query and the Teacher’s key, the model learns an Attention Matrix (\(\mathbf{a}^{t \to s}\)). This matrix represents the alignment relationship—it tells us which tokens in the Teacher’s sequence correspond to a specific token in the Student’s sequence.

Equation for the attention matrix from teacher to student.

We then use this matrix to compute a weighted sum of the Teacher’s values. The result is a sequence of Teacher hidden states that is perfectly aligned—token by token—with the Student’s sequence length.

Equation showing the calculation of aligned teacher hidden states using the attention matrix.

This aligned representation \(\tilde{\mathbf{h}}\) can now be plugged directly into the DSKD framework described in the previous section. The authors also perform the reverse operation (Student-to-Teacher alignment) to support dual-space distillation.

Equation for the attention matrix from student to teacher. Equation showing the calculation of aligned student hidden states.

This mechanism makes DSKD a universal framework, applicable to any pair of LLMs regardless of their vocabularies.

Experimental Results

The authors evaluated DSKD on instruction-following benchmarks (like Dolly, Self-Instruct, and Vicuna-Evaluation). They tested two scenarios: models with the same vocabulary and models with different vocabularies.

Scenario 1: Same Vocabulary (GPT-2)

In the first set of experiments, they distilled a GPT2-1.5B (Teacher) into a GPT2-120M (Student). They compared DSKD against standard KD methods like vanilla KL, Reverse KL (RKL), and Jensen-Shannon (JS) divergence.

Table 1 showing Rouge-L scores. DSKD outperforms baselines across all metrics for GPT-2.

As shown in Table 1, adding the “w/ DSKD” component consistently improved performance across all distance functions. For example, using DSKD with JS divergence boosted the average Rouge-L score from 16.98 to 18.37—a significant jump.

Scenario 2: Same Vocabulary (LLaMA)

They scaled up the experiment using LLaMA2-7B as the teacher and TinyLLaMA-1.1B as the student.

Table 2 showing Rouge-L scores for LLaMA experiments. DSKD shows significant gains over baselines.

Table 2 confirms the trend. DSKD achieved substantial gains. Notably, using Adaptive KL (AKL) with DSKD resulted in a massive performance increase, raising the average score from 23.30 to 26.65.

Scenario 3: Different Vocabularies

The bottom sections of Table 1 and Table 2 (labeled “Different Vocabularies”) show the power of the Cross-Model Attention (CMA) mechanism.

  • Qwen \(\to\) GPT-2: DSKD-CMA outperformed existing cross-tokenizer methods like MinED and ULD.
  • Mistral \(\to\) TinyLLaMA: DSKD-CMA-AKL achieved an average score of 26.99, beating the best baseline (MinED) by over 5 points. Interestingly, this result was even better than the LLaMA2 \(\to\) TinyLLaMA distillation, likely because Mistral is a stronger teacher than LLaMA2.

Analysis: Why does it work?

To verify that the “Dual-Space” concept was the driver of success, the authors performed an ablation study. They compared:

  1. Diff. Space: Standard KD (different heads).
  2. Student Space: Only projecting Teacher \(\to\) Student.
  3. DSKD: Both Student and Teacher spaces.

Table 3 showing ablation study results. KD in Student Space alone beats standard KD, but DSKD is best.

Table 3 shows that simply moving the distillation into the Student Space provides a major boost. Adding the Teacher Space (DSKD) refines the performance further.

Furthermore, the authors measured the actual geometric distance between the representation structures of the Student and Teacher after training.

Figure 3 showing that DSKD significantly reduces the distance between teacher and student representation structures compared to Vanilla KD.

Figure 3 validates the initial hypothesis: DSKD (orange box) results in a much smaller distance between the representation structures compared to Vanilla KD (green box). The student is truly “thinking” more like the teacher.

Finally, they asked GPT-4 to judge the quality of the generated responses.

Figure 2 showing GPT-4 win rates. DSKD wins over Vanilla KD in the majority of cases.

As seen in Figure 2, the DSKD-trained model was preferred by GPT-4 in the majority of cases compared to the baseline model.

Conclusion

The paper Dual-Space Knowledge Distillation highlights a subtle but critical inefficiency in how we compress Large Language Models. By treating the Teacher and Student as inhabiting separate vector spaces, standard Knowledge Distillation fails to fully align their representations.

DSKD solves this by explicitly projecting hidden states to unify the output spaces. When combined with Cross-Model Attention, it provides a robust framework that works even when models use completely different tokenizers.

As we continue to rely on massive LLMs, techniques like DSKD will be essential. They allow us to extract the high-level reasoning of massive models and inject it into compact, deployable students, making AI more accessible and efficient for everyone.

Experimental Details Appendix

For those interested in the specific hyperparameters used to achieve these results, the authors provided their training configurations, noting that the distillation temperature \(\tau\) was set to 2.0 based on validation performance.

Table 4 showing training configurations like learning rate and batch size.

The overall loss function balanced the task learning and distillation equally:

Total loss equation weighting cross-entropy and KD loss equally.