How to Turn Dense LLMs into Efficient Mixture-of-Experts with PESC

Large Language Models (LLMs) like GPT-4 and Llama 3 have become the de facto “experts” in natural language processing. Their ability to handle complex linguistic patterns is largely due to their massive scale. The prevailing wisdom, known as the scaling law, suggests that to get smarter models, we simply need to make them bigger.

However, there is a catch. As models grow, the computational cost to train and fine-tune them skyrockets. This is particularly true for Instruction Tuning—the phase where a pre-trained model is refined to follow human instructions across various domains like math, coding, and biology.

To handle disparate tasks, a model needs a high capacity. But simply making a dense model bigger is often computationally prohibitive for many researchers and organizations. This leads to a fascinating problem: How can we expand a model’s capacity without exploding the computational budget?

In this post, we will dive deep into a paper that proposes a clever solution: Parameter-Efficient Sparsity Crafting (PESC). This method takes a standard, dense model (like Llama-2) and “upcycles” it into a Mixture-of-Experts (MoE) model. The genius of PESC lies in how it does this using a fraction of the parameters usually required, making high-performance sparse models accessible on limited hardware.

Figure 1: Camelidae-8x34B-pro achieves excellent performance across general tasks.

As shown above, the models resulting from this technique (dubbed Camelidae) exhibit exceptional performance, rivaling and even beating massive proprietary models on general benchmarks. Let’s explore how this works.

Background: The Need for Capacity

Before dissecting the solution, we need to understand the architectural bottleneck.

Dense Models vs. Mixture-of-Experts (MoE)

Most open-source LLMs (like the original Llama series) are dense. This means that for every single token you input (e.g., the word “cat”), every parameter in the network is activated and used for computation. While effective, this is inefficient. You don’t need the model’s knowledge of astrophysics to process a recipe for pancakes.

Mixture-of-Experts (MoE) models solve this by replacing the Feed-Forward Network (FFN) layers with a group of “experts.” For any given input, a router decides which experts are best suited to handle the data. This allows the model to have a huge number of parameters (high capacity) but only use a small fraction of them for inference (low compute).

The Challenge of “Sparse Upcycling”

A technique called Sparse Upcycling allows researchers to initialize an MoE model using weights from a pre-trained dense model. Typically, you copy the FFN weights multiple times to create the initial experts.

However, training this upcycled model is heavy. If you have 8 experts, you now have 8 times the parameters in those layers to update. This demands massive GPU memory and compute resources, often putting MoE training out of reach for standard instruction tuning setups.

The Core Method: Parameter-Efficient Sparsity Crafting (PESC)

The authors introduce PESC to solve the resource problem. The core idea is to combine the architecture of MoE with the efficiency of Adapters (a concept from Parameter-Efficient Fine-Tuning, or PEFT).

The Architecture High-Level View

In a traditional transformer block, you have Attention layers and FFN layers. PESC transforms the dense FFN layer into a Sparse MoE layer.

Figure 2: Overview of the parameter-efficient sparsity crafting with parameter-efficient experts.

As illustrated in Figure 2 (right side), the new MoE layer consists of a Gate Router and several Parameter-Efficient Experts.

The critical innovation is in how the experts are constructed. In standard upcycling, every expert is a distinct, fully trainable neural network. In PESC, the authors keep the massive “backbone” weights of the experts frozen and shared, and only train tiny “Adapters” unique to each expert.

Detailed Design: Shared Weights and Adapters

Let’s break down the mathematics and structure.

In a standard Adapter setup (Equation 1), a small bottleneck module is inserted into a layer. For an input \(U\), the adapter output is:

Equation 1

Here, \(W_{down}\) and \(W_{up}\) are small matrices that project the data down to a lower dimension and back up. This adds very few parameters.

In the PESC MoE layer, the output \(y\) is calculated as the weighted sum of the selected experts (Equation 2):

Equation 2

Here, \(R(x)_i\) is the routing score (how much the model “trusts” expert \(i\)), and \(E_i(x)\) is the output of expert \(i\).

The Efficiency Hack: In traditional sparsity crafting, we initialize experts by copying the dense FFN weights (\(\theta_o\)) to every expert (\(\theta_i\)). We then optimize the objective function \(\mathcal{F}\) for every single expert:

Equation 3

This requires updating massive amounts of parameters (\(\theta_i^+\)).

PESC changes the game. Instead of training the heavy weights \(\theta\), the authors freeze the original FFN weights (\(\theta_o\)) and share them across all experts. To differentiate the experts, they insert a unique Adapter (\(\omega_i\)) into each one.

The optimization target changes to finding the best adapter weights \(\omega_i^+\):

Equation 4

Why does this matter? Because the number of parameters in an adapter (\(\omega\)) is minuscule compared to the FFN weights (\(\theta\)). The total parameter count for the system remains very close to the original dense model, as shown mathematically here:

Equation 5

This inequality proves that the size of \(n\) adapters plus one shared FFN is significantly smaller than \(n\) independent FFNs.

Visualizing the PESC Layer

Figure 3 provides a detailed look at this specific implementation.

Figure 3: Detailed design of the MoE layer for PESC utilizing parameter-efficient experts. All the FFN layers share the same weights.

Notice the distinct components:

Shared Weights: The FFN layers (in the dashed box on the left) are identical for every expert. They are frozen.
Adapters: Each “Expert” (1 through \(n\)) has its own unique Adapter.
Weighted Sum: The outputs are combined based on the router’s decision.

Mathematically, the output of a PESC expert \(A_i\) is a combination of the shared backbone and the unique adapter:

Equation 8

And the final layer output combines these adapter-enhanced experts:

Equation 7

The authors argue that because Adapters are “universal approximators” (they can theoretically learn any function), optimizing these adapters allows the system to approximate the performance of a fully trained MoE without the memory cost. This approximation is expressed as keeping the error (\(\xi\)) low:

Equation 6

The Routing Mechanism

How does the model know which expert to use? PESC employs a Top-K Gate Router.

For every token, the router calculates a score using a learnable weight matrix \(W_r\). It then applies a Softmax function but keeps only the top \(K\) values (usually Top-2), setting the rest to negative infinity (zero probability).

Equation 9

This ensures that for any given word, only 2 experts are active, keeping inference fast.

Load Balancing: A common failure mode in MoE training is “expert collapse,” where the router sends everything to just one expert, ignoring the others. To prevent this, the authors add an auxiliary loss function that encourages equal usage of all experts:

Equation 10

Here, \(f_i\) is the fraction of tokens sent to expert \(i\), and \(p_i\) is the routing probability. Minimizing this ensures a balanced distribution of work.

Experiments & Results

To validate PESC, the authors created the Camelidae model family. They upcycled Llama-2 (7B and 13B) and Yi (34B) models into MoEs with 8 experts each, using the “Top-2” routing strategy.

They compared these against the original dense models (called “Camel”) and other state-of-the-art sparse models like Mixtral.

1. Superiority Over Dense Models

The most direct comparison is between the dense baselines (Camel) and the sparse PESC models (Camelidae), both trained on the same data (IDAE-500K).

Table 2: Overall performance on all the evaluation benchmarks of dense models (Camel) and sparse (Camelidae) models across different model sizes.

Table 2 shows a clear trend: Sparse models consistently outperform their dense counterparts.

Camelidae-8x34B scores 75.6 on MMLU compared to 75.3 for the dense Camel-34B.
The gap is even wider in complex tasks like coding (HumanEval) and math (GSM8K). For example, on GSM8K, the sparse 34B model achieves 78.3% vs 76.1% for the dense version.

2. Comparison with State-of-the-Art

The authors pushed their 34B model further by training on a larger dataset (IDAE-720K) to create Camelidae-8x34B-pro. They compared this against GPT-3.5, Llama-2-70B, and Mixtral-8x7B.

Table 1: Performance of Camelidae-8x34B-pro on academic benchmarks.

As seen in Table 1, Camelidae-8x34B-pro is a powerhouse.

MMLU: It achieves 75.7%, beating GPT-3.5 (70.0%) and Mixtral (68.7%).
Math & Code: It achieves 79.4% on GSM8K (Math) and 48.8% on HumanEval (Code), outperforming the massive Llama2-70B Chat model in these reasoning-heavy domains.

3. Do Experts Actually Specialize?

One of the theoretical promises of MoE is that different experts will become specialists in different topics (e.g., one expert for math, another for biology). The authors analyzed the routing decisions to see if this actually happened.

Figure 4: Proportion of tokens assigned to each expert on different dataset subsets.

Figure 4 visualizes which experts were selected for different datasets:

SlimOrca (General instruction following)
Magicoder (Coding)
MetaMathQA (Mathematics)

The results are fascinating. Look at Expert 1 (Teal). It is heavily activated for Magicoder (Coding) but less so for Math. Conversely, Expert 6 (Light Yellow) sees massive activation for MetaMathQA. This confirms that the PESC method successfully encourages experts to specialize syntactically and semantically, even though they share the same FFN backbone.

4. The Impact of Expert Count

Does adding more experts help?

Table 4: Evaluation on different numbers of experts in the MoE layers.

Table 4 suggests yes. Moving from 4 experts to 16 experts yields consistent gains across Code, Math, and General Knowledge (MMLU). This indicates that the PESC method scales well—you can add more adapters (cheap parameters) to gain more capacity without a significant training penalty.

Conclusion and Implications

The “Camelidae” paper presents a compelling step forward for the accessibility of Large Language Models. By introducing Parameter-Efficient Sparsity Crafting (PESC), the authors have provided a blueprint for turning standard dense models into powerful Mixture-of-Experts systems without the prohibitive memory costs usually associated with MoE training.

Key Takeaways:

Efficiency: PESC drastically reduces the number of trainable parameters needed to create an MoE by sharing FFN weights and training only Adapters.
Performance: The resulting sparse models outperform their dense parents and compete with much larger proprietary models.
Specialization: Even with shared backbones, the inserted adapters allow experts to specialize in distinct domains like math or coding.

This research implies that the barrier to entry for training high-capacity, general-purpose models is lowering. Students and researchers with limited GPU resources can now effectively “upcycle” open-source models into experts capable of tackling complex, multi-domain instructions.

Background: The Need for Capacity#

Dense Models vs. Mixture-of-Experts (MoE)#

The Challenge of “Sparse Upcycling”#

The Core Method: Parameter-Efficient Sparsity Crafting (PESC)#

The Architecture High-Level View#

Detailed Design: Shared Weights and Adapters#

Visualizing the PESC Layer#

The Routing Mechanism#

Experiments & Results#

1. Superiority Over Dense Models#

2. Comparison with State-of-the-Art#

3. Do Experts Actually Specialize?#

4. The Impact of Expert Count#

Conclusion and Implications#