Taming the Spike - How WeSaR Stabilizes LLM Training by scaling Weights

Training Large Language Models (LLMs) is an expensive, high-stakes endeavor. Imagine allocating thousands of GPUs and millions of dollars to train a model like LLaMA or GPT, only to have the training run diverge halfway through. The loss value shoots up suddenly, a phenomenon known as a loss spike, and weeks of progress can be ruined.

Loss spikes are a fundamental issue in deep learning, particularly for Transformers. While engineers have developed various “band-aids”—like restarting training from a previous checkpoint or skipping data batches—the root causes remain partially obscure.

In a fascinating paper titled “Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes,” researchers from NTT Human Informatics Laboratories propose a novel explanation and a surprisingly elegant solution. They suggest that the problem lies in the non-uniformity of parameter norms. Simply put: some weights in the network are initialized to be much smaller than others. When these tiny weights are updated, they undergo massive relative changes, destabilizing the network.

Their solution is a technique called WeSaR (Weight Scaling as Reparameterization). In this post, we will tear down the mathematics of loss spikes, understand why standard initialization methods fail, and explore how WeSaR stabilizes training, allowing for faster convergence and better performance.

The Hidden Culprit: The Update Ratio

To understand why LLM training becomes unstable, we first need to look at how parameters change during training. In a neural network, we update a parameter matrix \(W\) by adding a small change, \(\Delta W\), calculated via back-propagation.

The researchers propose that the absolute size of the update \(\|\Delta W\|\) isn’t the only thing that matters. What matters is the Update Ratio:

\[ \text{Update Ratio} = \frac{\|\Delta W\|}{\|W\|} \]

This ratio represents the magnitude of the update relative to the parameter itself.

The Problem with Tiny Weights

Deep neural networks require specific initialization strategies (like He Initialization or Xavier Initialization) to prevent gradients from vanishing (becoming zero) or exploding (becoming infinity) as they pass through deep layers.

In Transformer models, to satisfy these requirements, specific layers—particularly the projection layers in the Feed-Forward Networks (\(W_d\))—must be initialized with very small values.

Here is the catch: If a parameter \(\|W\|\) is initialized to be tiny, even a moderate update \(\|\Delta W\|\) results in a massive Update Ratio.

Loss of Transformer models with 13 billion parameters at the beginning of training. The top chart shows loss spikes. The bottom chart shows that the update ratio for W_d is massive compared to W_u.

As shown in Figure 1 above, look at the bottom chart. The update ratio for \(W_d\) (the down-projection layer, plotted in red) is significantly higher than other parameters at the start of training. Now look at the top chart: this instability correlates directly with the massive spikes in loss.

The baseline method (blue line) struggles with these massive relative changes. The network is essentially performing “heart surgery” on these tiny parameters with “sledgehammer” updates.

Background: The Delicate Balance of Initialization

Before diving into the solution, we must understand why those weights were small in the first place.

Avoiding Exploding Gradients

In a deep network, we want the scale of the gradients to remain constant from the last layer back to the first. If gradients grow as they go backward, training diverges. If they shrink, the model stops learning.

Mathematically, for a layer mapping \(x\) to \(y\), we need the expectation of the gradient norms to match:

Equation showing the requirement for constant gradient norms throughout the network.

The Transformer Challenge

Transformers use residual connections (skip connections). The output of a layer is \(y = f(x) + x\). When you perform back-propagation through a sum, gradients flow through both paths.

Equation showing the gradient flow through a residual connection with Layer Norm.

Because of this residual path, the variance of the gradients tends to accumulate as you go deeper into the network (which can have dozens of layers). To counteract this, standard initialization techniques (like those used in GPT-2) scale down the weights of the residual layers by a factor of \(1/\sqrt{2N}\), where \(N\) is the number of layers.

For a 13-billion parameter model with 40 layers, this factor is roughly \(0.11\). This forces matrices like \(W_d\) to be initialized with much smaller norms than other matrices. This satisfies the gradient stability requirement but violates the update stability requirement, leading to the spikes we saw in Figure 1.

The Solution: WeSaR

The researchers propose WeSaR to resolve this conflict. The goal is to satisfy two conflicting requirements simultaneously:

Gradient Requirement: The effective weights must be small to prevent exploding gradients.
Update Requirement: The actual weight parameters must be large enough so that updates (\(\Delta W\)) don’t cause massive relative changes.

Decoupling Scale from Direction

WeSaR achieves this by reparameterizing the weights. Instead of learning a matrix \(W\) directly, the model learns a scalar gate \(\alpha\) and a matrix \(W\).

The forward pass looks like this:

Equation defining the reparameterization: W_bar equals (sigma/sigma) times W, which equals alpha times W.

Here:

\(W_{\cdot}\) is the Actual Parameter.
\(\bar{W}_{\cdot}\) is the Virtual Parameter used in the computation.
\(\alpha\) is a trainable Gate Parameter.

How It Works

1. Uniform Initialization of Actual Parameters (\(W\)): In standard methods, different layers get different standard deviations based on their size or depth. In WeSaR, all actual parameter matrices (\(W\)) are initialized with the same standard deviation \(\sigma\).

Equation showing that W is initialized with a common Gaussian distribution.

Crucially, this \(\sigma\) can be chosen to be small but uniform, ensuring that no single layer is “fragile” relative to others.

2. Adjustment via the Gate (\(\alpha\)): The necessary scaling required for stable back-propagation (the \(1/\sqrt{2N}\) factor mentioned earlier) is applied to the gate parameter \(\alpha\), not the matrix \(W\).

This means the effective weight \(\bar{W}\) is small (satisfying gradient constraints), but the actual weight \(W\) is standard (satisfying update stability).

Why Adam Doesn’t Break This

You might wonder: if we scale the weights, won’t the gradients just scale to match, cancelling out the benefit? This is where the Adam optimizer comes in.

Adam updates parameters based on the ratio of the gradient momentum (\(M_t\)) and the square root of the gradient variance (\(V_t\)).

Equation showing the Adam optimizer update rule.

When we reparameterize \(\bar{W} = \alpha W\), the gradients with respect to \(W\) are scaled by \(\alpha\). However, because Adam divides the momentum by the root variance, this scalar scaling factor largely cancels out in the update step.

Equation showing the gradient relationship between the loss and the parameters.

This theoretical insight confirms that WeSaR changes the geometry of the optimization landscape in a way that stabilizes the relative updates without breaking the learning process.

Experimental Results

The team tested WeSaR on Transformer models ranging from 130 Million to 13 Billion parameters. They compared it against the popular Small Initialization method (used in GPT-J and other open-source models) and other reparameterization techniques like Weight Normalization.

Eliminating Loss Spikes

The primary goal was to kill the spikes. Looking at the training loss for the 13B parameter model, the difference is stark.

Loss of 13B models during training. The red line (Proposed) is smooth and lower than the blue line (Baseline).

As seen in Figure 2, the baseline (Small Init) suffers from sharp spikes early in training. The WeSaR model (Red) not only avoids these spikes but achieves a consistently lower loss value throughout the run.

Stabilizing Update Ratios

Does WeSaR actually fix the “Update Ratio” problem?

Figure 9: Update ratio at the 40th layer of 13B models. Proposed methods show lower and more stable update ratios compared to baselines.

Figure 9 (above) shows the update ratios for the 40th layer of the 13B model. The dashed lines (Baseline) show massive fluctuations. The solid lines (Proposed WeSaR) are remarkably stable. This stability confirms that initializing all weights uniformly prevents specific layers from becoming “hypersensitive” to updates.

Investigating Parameter Norms

To prove that the mechanism works as intended, the researchers plotted the norms of the weights during training.

Figure 3: Norm of parameters W_d and W_u in the last layer. The proposed method keeps them overlapping and stable.

In Figure 3, notice the “Baseline” (Blue) line. It starts near zero and grows linearly. This growth represents the model desperately trying to increase the magnitude of those tiny initialized weights. This rapid growth is associated with instability.

In contrast, the Proposed method (Red and Pink lines) maintains a stable, higher norm for the actual parameters \(W_d\) and \(W_u\). The necessary scaling is handled by \(\alpha\), leaving the matrix \(W\) stable.

Performance Comparison

Finally, does this stability translate to better language modeling? Yes.

Table 5: Main results showing WeSaR outperforms Small Init on WikiText and LAMBADA perplexity.

Table 5 shows that WeSaR achieves lower (better) perplexity on WikiText and LAMBADA benchmarks across all model sizes (130M, 1.3B, and 13B).

Notably, WeSaR outperformed other sophisticated reparameterization techniques as well.

Table 6: Comparison of reparameterization methods. WeSaR is better or comparable to Weight Norm but faster.

As shown in Table 6, WeSaR is more effective than Weight Normalization and \(\sigma\)Reparam. Furthermore, because WeSaR only adds a single scalar multiplication (the gate) rather than complex normalization operations (like dividing by the standard deviation of the whole matrix), it is computationally cheaper. Weight Normalization, for instance, increased training time by 12.6%, whereas WeSaR added negligible overhead.

Hyperparameter Robustness

One of the hidden benefits of WeSaR is that it decouples the standard deviation of initialization (\(\sigma\)) from the hidden dimension size (\(d\)).

In standard Transformers, you often initialize weights with \(\sigma = \sqrt{1/d}\). As models get wider (larger \(d\)), \(\sigma\) must get smaller. WeSaR allows you to set a fixed, small \(\sigma\) (e.g., \(\sigma^2 = 4e-5\)) regardless of model size.

Table 9: Robustness versus standard deviation and learning rate.

Table 9 demonstrates that WeSaR is robust to different \(\sigma\) values and, crucially, allows for higher learning rates (1e-3 vs the standard 5e-4). Being able to train at higher learning rates without crashing is a “holy grail” for LLM pre-training, as it speeds up convergence significantly.

Conclusion

The “Loss Spike” has long been a ghost in the machine for LLM practitioners—a random, frustrating event usually blamed on “bad data” or “bad luck.” The work by Nishida et al. shines a light on a structural cause: the arithmetic conflict between gradient stability (which demands tiny weights) and update stability (which demands normal-sized weights).

WeSaR offers a compelling solution:

Reparameterize: Use a gate \(\alpha\) to handle the scale.
Initialize Uniformly: Let the actual weights \(W\) sit in a safe, uniform range.

By doing so, WeSaR stabilizes the update ratios, prevents loss spikes, and enables faster, more aggressive training schedules. As language models continue to scale to trillions of parameters, efficient and stable initialization techniques like WeSaR will likely become standard components in the deep learning stack.

The Hidden Culprit: The Update Ratio#

The Problem with Tiny Weights#

Background: The Delicate Balance of Initialization#

Avoiding Exploding Gradients#

The Transformer Challenge#

The Solution: WeSaR#

Decoupling Scale from Direction#

How It Works#

Why Adam Doesn’t Break This#

Experimental Results#

Eliminating Loss Spikes#

Stabilizing Update Ratios#

Investigating Parameter Norms#

Performance Comparison#

Hyperparameter Robustness#

Conclusion#