Cracking the Code of MoE: Do Scaling Laws Apply to Mixture of Experts?

The world of Large Language Models (LLMs) is currently in an arms race. From GPT-4 to Llama and Gemini, the push has been to build bigger, more capable models. However, we are hitting a wall: computational cost. Training these massive “Dense” models—where every single parameter is active for every single calculation—is becoming prohibitively expensive.

Enter the Mixture of Experts (MoE) architecture. MoE models promise a way out of this efficiency bottleneck by using “conditional computation.” Instead of firing every neuron for every token, the model activates only a specific subset of “experts” relevant to the current input. It’s the difference between consulting a whole university faculty for a simple math problem versus just asking the math professor.

But here is the billion-dollar question: Do the established rules of AI training—the “Scaling Laws”—apply to this different architecture?

Historically, we have relied on scaling laws derived from Dense models (like the famous Chinchilla laws) to predict how much data and compute we need. In this post, we will dive deep into a comparative analysis by Wang et al., investigating whether these laws transfer to MoE models and what that means for the future of efficient AI.

The Foundation: Understanding Scaling Laws

Before we dissect the differences, we need to understand the baseline. Scaling laws are essentially the “physics” of training neural networks. They observe that model performance (measured by loss) improves predictably as you increase three ingredients:

N: The number of parameters (Model Scale).
D: The amount of training data (Tokens).
C: The compute budget (FLOPs).

For standard Dense models, this relationship follows a power-law. Previous research established the following equation to predict training loss:

The standard scaling law equation for Dense models.

Subject to the constraint that the total compute budget (\(C\)) is a function of parameters and data:

Constraint showing Compute equals FLOPs of Parameters and Data.

Here, \(A\), \(B\), \(\alpha\), and \(\beta\) are coefficients specific to the model architecture, and \(\sigma\) represents the irreducible noise in the dataset (the best possible loss you could ever achieve).

The Challenge with MoE

Mixture of Experts models introduce a new variable: \(E\) (the number of experts).

In an MoE model, you might have 8 experts, but for any given token, a “gating network” might only route the signal to 2 of them. This means the model has a massive number of total parameters, but a much smaller number of active parameters during inference.

Previous researchers attempted to fit MoE models to scaling laws by treating the number of experts as a separate logarithmic term:

Previous MoE scaling law attempt with logarithmic terms.

However, this equation suggests that as the model gets bigger, the benefit of adding more experts vanishes. The researchers behind today’s paper realized that for practical MoE setups (where \(E < 100\)), this complexity is unnecessary. They hypothesized that the fundamental power-law framework should still hold, just with a unified adjustment.

A Unified Scaling Law

The researchers propose a new, unified scaling law that bridges the gap between Dense and MoE architectures. By simplifying the interaction between model scale (\(N\)) and experts (\(E\)), they derived this elegant equation:

The proposed Unified Scaling Law equation.

In this equation:

\(N\) is the model scale (specifically, non-embedding FLOPs divided by tokens).
\(E\) is the number of experts.
\(D\) is the training tokens.

Does this theory hold up in reality? The team trained various MoE models (from 200M up to 1.5B parameters) on over 100 billion tokens to verify this.

As shown below, the experimental results (the blue line) match the predicted curve (the orange line) almost perfectly.

Figure 1: The extrapolated scaling curves for 1.5B Mixture of Experts (MoE) models.

This is a significant finding. It proves that MoE models follow the same fundamental physics as Dense models. We don’t need to reinvent the wheel; we just need to calibrate it.

Optimal Resource Allocation: The “Compute-Optimal” Frontier

Now that we have a working equation, we can ask the most practical question in AI engineering: If I have a fixed budget of Compute (\(C\)), how should I split it between making the model bigger (\(N\)) or buying more data (\(D\))?

Mathematically, we are trying to solve this minimization problem:

Minimization function for optimal Data and Model scale given a budget.

By taking the derivative of the loss function with respect to \(N\) and \(D\), the researchers derived formulas for the optimal number of tokens (\(D_{opt}\)) and optimal model scale (\(N_{opt}\)) for any given budget:

Equation for optimal model scale N based on budget.

Equation for optimal data D based on budget.

The “Data-Hungry” Nature of MoE

When the researchers calculated the coefficients for these equations, they found a fascinating distinction between the architectures.

Table 1: Coefficients of optimal model and data scaling allocation.

Look at the MoE Model row in Table 1.

The exponent for optimal model scale (\(\alpha_N\)) is 0.590, which is higher than the Dense model’s 0.507.
The exponent for optimal data (\(\alpha_D\)) is 0.410, which is lower than the Dense model’s 0.493.

What does this mean? It means MoE models benefit more from increasing the model size (parameters) than Dense models do. Conversely, for a fixed budget, MoE models are more efficient at utilizing data. The analysis suggests MoE models have about 16.37% better data utilization than Dense models. If you are training an MoE, you should prioritize scaling up your model architecture slightly more aggressively than you would for a Dense model.

Hyperparameter Tuning: Batch Size and Learning Rate

Knowing the model size is half the battle. The other half is training it correctly. Two of the most critical hyperparameters in Deep Learning are Batch Size and Learning Rate. The researchers found that the optimal settings for these also follow predictable power laws.

1. Optimal Batch Size (\(B_{opt}\))

The “Critical Batch Size” is the point where increasing the batch size further yields diminishing returns in training speed (measured in data efficiency). This is closely related to the “noise” in the gradients. If your gradients are very noisy, you need a larger batch size to average out that noise and get a clear signal.

The relationship between noise (\(B_{noise}\)) and the Hessian matrix (\(H\)) is defined as:

Equation defining Noise Scale based on Hessian and Gradient.

Empirically, the optimal batch size is approximately equal to this noise scale:

Approximation of optimal batch size to noise scale.

The researchers mapped the training loss against optimal batch sizes for both architectures.

Figure 2: Heatmap of training loss vs optimal batch size.

The results, visualized in the log-log plots below, show a consistent power-law relationship. However, there is a crucial difference.

Figure 3: Log-log relationship of training loss vs optimal batch size.

The Insight: For the same level of training loss, MoE models have a smaller optimal batch size than Dense models.

This implies that MoE models have a smaller noise scale. The gradients calculated during MoE training are “cleaner” or less noisy than those in Dense models. This allows MoE models to achieve stable optimization with fewer samples per step.

2. Optimal Learning Rate (\(\epsilon_{opt}\))

If MoE models have less gradient noise, how does that affect the Learning Rate (LR)? Generally, if your signal is clean (low noise), you can afford to take bigger steps (higher LR) without overshooting the minimum.

The theoretical relationship suggests the optimal learning rate scales with the inverse of the loss:

Power law relationship for optimal learning rate.

The researchers verified this by plotting the optimal learning rate against loss.

Figure 4: Heatmap of training loss vs optimal learning rate.

And comparing the trends:

Figure 5 & 6: Log-log relationship of loss vs learning rate, and testing loss vs compute.

(Note: The image above contains two plots. Figure 5 on the left shows the Learning Rate scaling. Figure 6 on the right shows Generalization, which we will discuss next.)

Looking at Figure 5 (left side of the image above), we see that for a fixed loss, MoE models (the reddish dots) generally support a higher learning rate compared to Dense models.

The Synthesis:

Dense Models: High gradient noise \(\rightarrow\) Need larger batches \(\rightarrow\) require smaller learning rates to remain stable.
MoE Models: Low gradient noise \(\rightarrow\) Can use smaller batches \(\rightarrow\) Can handle larger learning rates.

This combination makes MoE models significantly faster to converge during training.

Generalization: Do They Actually Perform Better?

Theoretical efficiency is great, but does it translate to better downstream performance?

Referencing Figure 6 (the right side of the image shown in the previous section), the researchers plotted Testing Loss vs. Compute Budget.

You can clearly see that the MoE curve sits below the Dense curve. This indicates that for the exact same compute budget, MoE models achieve a lower testing loss.

The researchers validated this on several difficult benchmarks, including TriviaQA, MATH, and MMLU.

Table 2: Performance details of Dense and MoE models.

Table 2 confirms the theory. Look at the comparison between Dense-1B and MoE-1.5B (which have comparable active parameters/compute profiles). The MoE model significantly outperforms the Dense model on reasoning-heavy tasks like MATH (4.24 vs 1.48) and knowledge tasks like TriviaQA (26.25 vs 20.56).

Conclusion

This research provides a “user manual” for the next generation of efficient Large Language Models. By rigorously comparing Dense and MoE architectures, the authors have confirmed that we don’t need to fly blind when scaling up Mixture of Experts models.

Key Takeaways for Students and Practitioners:

Universality: The power-law scaling frameworks developed for Dense models transfer remarkably well to MoE models. You can predict performance before you train.
Efficiency: MoE models are not just computationally cheaper during inference; they are fundamentally more data-efficient during training (~16% better utilization).
Stability: Because of their sparse nature, MoE models experience lower gradient noise. This allows for training recipes that use smaller batch sizes and higher learning rates, speeding up convergence.
Strategy: If you are allocated a fixed compute budget for an MoE model, the math suggests you should lean slightly more toward increasing the model size (number of experts/parameters) rather than just piling on more tokens.

As the AI community moves toward trillion-parameter models, these insights into resource allocation and hyperparameter dynamics will be essential for training the intelligent systems of tomorrow without breaking the bank.

The Foundation: Understanding Scaling Laws#

The Challenge with MoE#

A Unified Scaling Law#

Optimal Resource Allocation: The “Compute-Optimal” Frontier#

The “Data-Hungry” Nature of MoE#

Hyperparameter Tuning: Batch Size and Learning Rate#

1. Optimal Batch Size (\(B_{opt}\))#

2. Optimal Learning Rate (\(\epsilon_{opt}\))#

Generalization: Do They Actually Perform Better?#

Conclusion#