The world of Large Language Models (LLMs) is currently in an arms race. From GPT-4 to Llama and Gemini, the push has been to build bigger, more capable models. However, we are hitting a wall: computational cost. Training these massive “Dense” models—where every single parameter is active for every single calculation—is becoming prohibitively expensive.
Enter the Mixture of Experts (MoE) architecture. MoE models promise a way out of this efficiency bottleneck by using “conditional computation.” Instead of firing every neuron for every token, the model activates only a specific subset of “experts” relevant to the current input. It’s the difference between consulting a whole university faculty for a simple math problem versus just asking the math professor.
But here is the billion-dollar question: Do the established rules of AI training—the “Scaling Laws”—apply to this different architecture?
Historically, we have relied on scaling laws derived from Dense models (like the famous Chinchilla laws) to predict how much data and compute we need. In this post, we will dive deep into a comparative analysis by Wang et al., investigating whether these laws transfer to MoE models and what that means for the future of efficient AI.
The Foundation: Understanding Scaling Laws
Before we dissect the differences, we need to understand the baseline. Scaling laws are essentially the “physics” of training neural networks. They observe that model performance (measured by loss) improves predictably as you increase three ingredients:
- N: The number of parameters (Model Scale).
- D: The amount of training data (Tokens).
- C: The compute budget (FLOPs).
For standard Dense models, this relationship follows a power-law. Previous research established the following equation to predict training loss:

Subject to the constraint that the total compute budget (\(C\)) is a function of parameters and data:

Here, \(A\), \(B\), \(\alpha\), and \(\beta\) are coefficients specific to the model architecture, and \(\sigma\) represents the irreducible noise in the dataset (the best possible loss you could ever achieve).
The Challenge with MoE
Mixture of Experts models introduce a new variable: \(E\) (the number of experts).
In an MoE model, you might have 8 experts, but for any given token, a “gating network” might only route the signal to 2 of them. This means the model has a massive number of total parameters, but a much smaller number of active parameters during inference.
Previous researchers attempted to fit MoE models to scaling laws by treating the number of experts as a separate logarithmic term:

However, this equation suggests that as the model gets bigger, the benefit of adding more experts vanishes. The researchers behind today’s paper realized that for practical MoE setups (where \(E < 100\)), this complexity is unnecessary. They hypothesized that the fundamental power-law framework should still hold, just with a unified adjustment.
A Unified Scaling Law
The researchers propose a new, unified scaling law that bridges the gap between Dense and MoE architectures. By simplifying the interaction between model scale (\(N\)) and experts (\(E\)), they derived this elegant equation:

In this equation:
- \(N\) is the model scale (specifically, non-embedding FLOPs divided by tokens).
- \(E\) is the number of experts.
- \(D\) is the training tokens.
Does this theory hold up in reality? The team trained various MoE models (from 200M up to 1.5B parameters) on over 100 billion tokens to verify this.
As shown below, the experimental results (the blue line) match the predicted curve (the orange line) almost perfectly.

This is a significant finding. It proves that MoE models follow the same fundamental physics as Dense models. We don’t need to reinvent the wheel; we just need to calibrate it.
Optimal Resource Allocation: The “Compute-Optimal” Frontier
Now that we have a working equation, we can ask the most practical question in AI engineering: If I have a fixed budget of Compute (\(C\)), how should I split it between making the model bigger (\(N\)) or buying more data (\(D\))?
Mathematically, we are trying to solve this minimization problem:

By taking the derivative of the loss function with respect to \(N\) and \(D\), the researchers derived formulas for the optimal number of tokens (\(D_{opt}\)) and optimal model scale (\(N_{opt}\)) for any given budget:


The “Data-Hungry” Nature of MoE
When the researchers calculated the coefficients for these equations, they found a fascinating distinction between the architectures.

Look at the MoE Model row in Table 1.
- The exponent for optimal model scale (\(\alpha_N\)) is 0.590, which is higher than the Dense model’s 0.507.
- The exponent for optimal data (\(\alpha_D\)) is 0.410, which is lower than the Dense model’s 0.493.
What does this mean? It means MoE models benefit more from increasing the model size (parameters) than Dense models do. Conversely, for a fixed budget, MoE models are more efficient at utilizing data. The analysis suggests MoE models have about 16.37% better data utilization than Dense models. If you are training an MoE, you should prioritize scaling up your model architecture slightly more aggressively than you would for a Dense model.
Hyperparameter Tuning: Batch Size and Learning Rate
Knowing the model size is half the battle. The other half is training it correctly. Two of the most critical hyperparameters in Deep Learning are Batch Size and Learning Rate. The researchers found that the optimal settings for these also follow predictable power laws.
1. Optimal Batch Size (\(B_{opt}\))
The “Critical Batch Size” is the point where increasing the batch size further yields diminishing returns in training speed (measured in data efficiency). This is closely related to the “noise” in the gradients. If your gradients are very noisy, you need a larger batch size to average out that noise and get a clear signal.
The relationship between noise (\(B_{noise}\)) and the Hessian matrix (\(H\)) is defined as:

Empirically, the optimal batch size is approximately equal to this noise scale:

The researchers mapped the training loss against optimal batch sizes for both architectures.

The results, visualized in the log-log plots below, show a consistent power-law relationship. However, there is a crucial difference.

The Insight: For the same level of training loss, MoE models have a smaller optimal batch size than Dense models.
This implies that MoE models have a smaller noise scale. The gradients calculated during MoE training are “cleaner” or less noisy than those in Dense models. This allows MoE models to achieve stable optimization with fewer samples per step.
2. Optimal Learning Rate (\(\epsilon_{opt}\))
If MoE models have less gradient noise, how does that affect the Learning Rate (LR)? Generally, if your signal is clean (low noise), you can afford to take bigger steps (higher LR) without overshooting the minimum.
The theoretical relationship suggests the optimal learning rate scales with the inverse of the loss:

The researchers verified this by plotting the optimal learning rate against loss.

And comparing the trends:

(Note: The image above contains two plots. Figure 5 on the left shows the Learning Rate scaling. Figure 6 on the right shows Generalization, which we will discuss next.)
Looking at Figure 5 (left side of the image above), we see that for a fixed loss, MoE models (the reddish dots) generally support a higher learning rate compared to Dense models.
The Synthesis:
- Dense Models: High gradient noise \(\rightarrow\) Need larger batches \(\rightarrow\) require smaller learning rates to remain stable.
- MoE Models: Low gradient noise \(\rightarrow\) Can use smaller batches \(\rightarrow\) Can handle larger learning rates.
This combination makes MoE models significantly faster to converge during training.
Generalization: Do They Actually Perform Better?
Theoretical efficiency is great, but does it translate to better downstream performance?
Referencing Figure 6 (the right side of the image shown in the previous section), the researchers plotted Testing Loss vs. Compute Budget.
You can clearly see that the MoE curve sits below the Dense curve. This indicates that for the exact same compute budget, MoE models achieve a lower testing loss.
The researchers validated this on several difficult benchmarks, including TriviaQA, MATH, and MMLU.

Table 2 confirms the theory. Look at the comparison between Dense-1B and MoE-1.5B (which have comparable active parameters/compute profiles). The MoE model significantly outperforms the Dense model on reasoning-heavy tasks like MATH (4.24 vs 1.48) and knowledge tasks like TriviaQA (26.25 vs 20.56).
Conclusion
This research provides a “user manual” for the next generation of efficient Large Language Models. By rigorously comparing Dense and MoE architectures, the authors have confirmed that we don’t need to fly blind when scaling up Mixture of Experts models.
Key Takeaways for Students and Practitioners:
- Universality: The power-law scaling frameworks developed for Dense models transfer remarkably well to MoE models. You can predict performance before you train.
- Efficiency: MoE models are not just computationally cheaper during inference; they are fundamentally more data-efficient during training (~16% better utilization).
- Stability: Because of their sparse nature, MoE models experience lower gradient noise. This allows for training recipes that use smaller batch sizes and higher learning rates, speeding up convergence.
- Strategy: If you are allocated a fixed compute budget for an MoE model, the math suggests you should lean slightly more toward increasing the model size (number of experts/parameters) rather than just piling on more tokens.
As the AI community moves toward trillion-parameter models, these insights into resource allocation and hyperparameter dynamics will be essential for training the intelligent systems of tomorrow without breaking the bank.
](https://deep-paper.org/en/paper/2410.05661/images/cover.png)