The landscape of Large Language Models (LLMs) is currently defined by two conflicting forces: the drive for massive scale and the constraint of limited computational resources. We want models that know everything, but we don’t always have the hardware to train them.
This has led to the explosion of Parameter-Efficient Fine-Tuning (PEFT) techniques. Methods like LoRA (Low-Rank Adaptation) have become household names for students and practitioners, allowing us to adapt giant dense models like Llama-3 or Mistral on consumer-grade hardware.
But the architecture of LLMs is shifting. We are moving away from “dense” models—where every parameter is used for every token—toward Mixture-of-Experts (MoE) architectures, such as Mixtral or DeepSeek-V2. In an MoE model, only a fraction of the network is active at any given time.
Here lies the problem: Most existing PEFT methods treat sparse MoE models as if they were dense. They apply updates uniformly or add adapter layers everywhere, ignoring the unique structure of the model.
In this post, we are diving deep into a paper from DeepSeek AI and Northwestern University titled “Let the Expert Stick to His Last.” The researchers propose Expert-Specialized Fine-Tuning (ESFT), a method that respects the sparse nature of MoEs. By identifying and tuning only the experts relevant to a specific task, ESFT matches the performance of full fine-tuning with a fraction of the compute and storage cost.
Part 1: The Context
To understand why ESFT is such a clever innovation, we first need to establish how Mixture-of-Experts models work and why standard fine-tuning is inefficient for them.
The MoE Paradigm
In a standard Transformer (a dense model), an input token passes through a Feed-Forward Network (FFN) layer, and every neuron in that layer participates in the calculation.
In a Mixture-of-Experts (MoE) model, that single FFN layer is replaced by a bank of multiple smaller FFNs, called “experts.” For every token, a “router” (or gate) decides which experts should handle the workload.
The output hidden state \(\mathbf{h}_t^l\) of the \(t\)-th token in the \(l\)-th layer is calculated as follows:

Here, \(N\) is the total number of experts, and \(g_{i,t}\) is the gate value (or weight) assigned to expert \(i\) for token \(t\). The magic happens in the gating function. Most gate values are zero, meaning most experts are ignored. The router selects only the Top-K experts based on an affinity score.
The gating decision logic looks like this:

The affinity score \(s_{i,t}\) is usually derived from a Softmax function applied to the dot product of the token input and the expert’s centroid (a learnable vector representing the expert):

The DeepSeek-V2 Twist: Fine-Grained Experts
The paper utilizes the DeepSeek-V2 architecture, which introduces a “fine-grained” approach. Instead of having 8 massive experts (like Mixtral 8x7B), DeepSeek-V2 splits the FFNs into many smaller experts (e.g., 162 experts).
Furthermore, it uses Shared Expert Isolation. Some experts are designated as “shared” and are always active for every token, capturing common knowledge, while “non-shared” experts are routed dynamically.
The output in this advanced architecture looks like this:

In this setup, \(K_s\) represents the shared experts that always fire, while the router selects the remaining experts from the non-shared pool.
The Problem with Current PEFT
When we use a method like LoRA on an MoE model, we typically add low-rank matrices to all linear layers, or at least all the experts. This is somewhat wasteful. If a downstream task (say, solving math problems) only utilizes 10% of the experts, why are we updating the parameters (or learning adapters) for the other 90%?
Full-Parameter Fine-Tuning (FFT) is even worse. It updates every weight in the model. In an MoE, this means updating experts that might barely be active for the target task, potentially overwriting their specialized knowledge required for other tasks (a phenomenon known as catastrophic forgetting).
Part 2: The Intuition – Do Experts Actually Specialize?
The core hypothesis of this paper is simple: In an MoE model, different experts specialize in different tasks.
If this is true, we shouldn’t need to retrain the whole brain to teach the model a new skill; we should just send the new information to the relevant specialists.
The researchers conducted probing experiments to verify this. They fed different types of data (Math, Code, Law, Translation, etc.) into the model and tracked which experts were activated.
Observation 1: Concentration
First, they looked at the routing distribution for specific tasks. They found that for any given task, the “attention” of the router is highly concentrated.

As shown in Figure 2, a small subset of experts captures the vast majority of the gate values (the y-axis). The curves drop off sharply. This implies that for a task like Math (purple line), only a handful of experts are doing the heavy lifting.
Observation 2: Distinctness
Second, they checked if different tasks use different experts. They visualized the overlap of active experts across tasks.

Figure 3 illustrates this perfectly. The diagonal (Self-Self) values are high, meaning a task consistently uses its own set of experts. However, the off-diagonal values are near zero. The experts used for “Code” have almost no overlap with the experts used for “Translation.”
This confirms the “Expert Specialization” hypothesis: The model has implicitly organized itself into specialized modules.
Part 3: The Method – Expert-Specialized Fine-Tuning (ESFT)
Based on the observations above, the authors propose ESFT. The workflow is straightforward yet powerful.
Step 1: Data Sampling
We take a small subset of the training data for our downstream task (e.g., a few hundred lines of Python code if we are tuning for coding). The authors found that a very small sample (32 sequences of length 4096) is statistically robust enough to determine expert relevance.
Step 2: Calculate Expert Relevance
We need a metric to rank the experts: “How important is Expert X for this Task?” The paper proposes two scoring methods.
Method A: Average Gate Score (ESFT-Gate) This measures the intensity of the router’s activation. If the router consistently assigns high weights to an expert, it gets a high score.

Method B: Token Selection Ratio (ESFT-Token) This simply counts how often an expert is selected (is it in the Top-K?), regardless of the gate magnitude.

Step 3: Selection and Tuning
Once every expert has a score for every layer, we select the top experts until their cumulative score surpasses a threshold \(p\) (a hyperparameter, e.g., covering top 20% of importance).

Finally, during training:
- Unfreeze the selected experts.
- Freeze all other experts.
- Freeze shared experts (usually).
- Freeze other modules (attention, routers, etc.).
We then fine-tune using standard backpropagation, but gradients only update the “specialist” experts.
Visual Comparison
Let’s look at how ESFT compares to Full-Parameter Fine-Tuning (FFT) and LoRA structurally.

- FFT (Left): Everything is green (trainable). This is computationally heavy and risks forgetting.
- LoRA (Middle): We freeze the blue block but attach green side-paths (low-rank matrices). This saves memory but modifies the path for all data.
- ESFT (Right): We look inside the MoE block. We see that only specific experts (e.g., Expert 2 and Expert 5) are green. The router and other experts remain frozen (blue). This is “surgery” rather than “supplementation.”
Part 4: Experimental Results
Does this precision surgery actually work? The researchers tested ESFT on DeepSeek-V2-Lite (16B parameters, ~2B active) across two categories:
- Enhancement: Improving existing skills (Math, Code).
- Adaptation: Teaching new, narrow skills (Intent Recognition, Legal Judgment, Translation).
Performance vs. Baselines
The results were impressive. ESFT consistently outperformed LoRA and rivaled (or beat) Full Fine-Tuning.

In Table 1, look at the “Average” column.
- LoRA: 44.9
- FFT: 51.0
- ESFT-Gate: 50.2
ESFT matches the performance of tuning the entire model, despite tuning only a fraction of the parameters. In specialized tasks like “Intent” recognition, ESFT-Gate (78.6) actually effectively ties with FFT (78.8) and destroys LoRA (67.8).
The “General Ability” Bonus
One of the biggest risks in fine-tuning is that by optimizing for Task A, the model becomes stupid at Task B. This is where ESFT shines. Because it leaves the majority of the experts (those irrelevant to the current task) untouched, the model retains its general capabilities better.
The paper shows that on general benchmarks (MMLU, TriviaQA), ESFT suffers significantly less degradation than FFT.
Efficiency
The primary goal of PEFT is efficiency. ESFT delivers here as well.

Figure 5 visualizes the costs:
- Green Line (Storage): FFT requires massive storage (top right dot). ESFT (middle dots) requires vastly less, hovering near LoRA (bottom left).
- Blue Bars (Time): ESFT trains faster than FFT because fewer gradients need to be computed and synchronized.
The authors note that ESFT reduces storage by up to 90% and training time by 30% compared to full fine-tuning.
Expert Selection Visualization
It is fascinating to see how many experts ESFT actually selects. Is it tuning half the model? Or just 1%?

Figure 4 shows the count of trained experts per layer (X-axis) for different tasks (Y-axis). The model has 64 non-shared experts per layer.
- Notice the numbers: 2, 4, 9, 12… rarely exceeding 15.
- This means ESFT is typically training only 5% to 20% of the experts.
- Interestingly, specialized tasks like Translation (bottom row) require very few experts (dark blue cells), while broader tasks like Code activate a wider range.
Part 5: Why Does It Work? The Importance of Granularity
The researchers dug deeper to understand why ESFT works so well for DeepSeek-V2. They posit that the fine-grained nature of the experts is the key.
If experts are “coarse” (large and few), each expert acts like a generalist, handling many different topics. If you freeze a coarse expert, you might block necessary pathways. If you tune it, you might overwrite too much knowledge.
But with “fine-grained” experts (small and many), each expert can be highly specialized (e.g., one expert just for “punctuation in Python code”).
To prove this, they simulated coarse-grained experts by grouping the fine-grained ones together and forcing them to be selected as a unit.

Figure 7 shows the result. The X-axis represents “Group Size” (making experts effectively larger/coarser).
- Blue/Green Lines (Performance): As the group size increases (experts get coarser), the performance of ESFT drops significantly.
- Orange Lines (Parameter Count): Simultaneously, the number of parameters you have to tune increases.
This leads to a crucial insight: Sparse-Specific Fine-Tuning strategies like ESFT are the future, but they rely on model architectures evolving toward finer granularity.
Conclusion
The paper “Let the Expert Stick to His Last” offers a compelling narrative for the future of LLM customization. It moves us away from the brute-force approach of “train everything” and the generic approach of “slap a low-rank adapter on everything.”
Instead, it treats the MoE model as a collection of specialized tools.
- Identify the right tools for the job (using Gate or Token scores).
- Sharpen only those tools (Fine-tune selected experts).
- Leave the rest alone to preserve general competence.
For students and researchers, the takeaway is clear: As models become more sparse and modular (to handle scale), our training techniques must become more surgical. ESFT is a prime example of how understanding the underlying architecture—specifically the specialization of experts—can lead to methods that are both efficient and highly effective.
This post summarizes the research by Wang et al. from DeepSeek AI. All figures and equations are sourced from the original paper.
](https://deep-paper.org/en/paper/2407.01906/images/cover.png)