The world of Large Language Models (LLMs) is dominated by a single, relentless trend: scaling. With every generation, models get larger, context windows get longer, and the capabilities become more impressive. However, this growth comes with a hefty price tag. Fine-tuning these massive models on specific downstream tasks—like medical diagnosis or legal summarization—requires computational resources that are out of reach for most researchers and students.

To solve this, the community turned to Parameter-Efficient Fine-Tuning (PEFT). Methods like LoRA (Low-Rank Adaptation) became the gold standard, allowing us to tweak a tiny fraction of a model’s weights while keeping the rest frozen. But there is a hidden inefficiency in how we currently approach PEFT. Most popular methods focus heavily on the Attention blocks of the Transformer architecture.

But what about the rest of the model?

In this post, we will dive deep into SparseGrad, a novel research paper that shifts the focus to the often-overlooked MLP (Multi-Layer Perceptron) blocks. We will explore how transforming gradients into a new “sparse” space allows us to fine-tune these dense layers efficiently, outperforming current state-of-the-art methods like LoRA.

The Elephant in the Room: MLP Blocks

Before understanding the solution, we must understand the problem with current PEFT approaches. A Transformer model is essentially a stack of layers, primarily consisting of Attention mechanisms and Feed-Forward Networks (MLPs).

Popular techniques like LoRA typically apply low-rank matrices to the Attention weights. While effective, this ignores a massive portion of the model’s anatomy. As the authors of SparseGrad point out, MLP blocks actually contain about half—sometimes significantly more—of the total parameters in modern Transformers.

Table 1: Number of parameters for different layers in models based on the Transformer.

As shown in Table 1 above, for models like LLaMa-2, the MLP blocks constitute 64% of the total parameter count. By ignoring these blocks or struggling to fine-tune them efficiently, we are leaving a massive amount of the model’s capacity on the table.

The challenge has always been that MLPs are “dense.” Unlike attention heads which can be approximated well with low-rank decomposition, MLP layers usually require updating a lot of parameters to change the model’s behavior effectively, which spikes memory usage. The SparseGrad researchers propose a different route: Selectivity via Transformation.

The Core Intuition: Changing the Perspective

Imagine you are looking at a starry night sky. From where you stand, the stars (data points) look scattered randomly everywhere. If you wanted to “paint” over the important stars, you’d have to paint the whole sky.

But what if you could tilt your head (or rotate the universe) in such a way that all the important stars aligned into a single, thin line? Suddenly, you only need to paint that one line, and you can ignore the rest of the black space.

This is the intuition behind SparseGrad.

The researchers hypothesized that while the gradients (the signals telling the model how to update weights) look dense and messy in the standard coordinate system, there exists a specific basis (a specific rotation of the mathematical space) where these gradients become sparse. In this new space, only about 1% of the elements are significant, and the rest are near zero.

The SparseGrad Method

The SparseGrad method operates in three distinct phases: the Preliminary Phase (finding the rotation), the Layer Replacement (applying the rotation), and the Sparse-by-Dense Multiplication (optimizing the computation).

1. The Preliminary Phase: Finding the Basis

To find the “angle” that makes the gradients sparse, the method starts with a calibration step.

Freeze the entire model.
Unfreeze only the linear layers in the MLP blocks.
Run a few steps of standard backpropagation on a small set of data.

This generates a collection of weight gradient matrices. The researchers then stack these matrices into a massive 3D tensor and apply a technique called Higher Order SVD (HOSVD).

HOSVD allows them to decompose this tensor to find two transformation matrices, \(U\) and \(V\). These matrices act as our “rotators.” They define the transition from the standard parameter space to the new, sparse parameter space.

Does it work? Visually, the difference is striking.

Figure 2: Gradients on the 5-th BERT MLP. The transformed gradient (right) is more sparse than the original (left).

In Figure 2, the heatmap on the left shows the gradients in the original space—activity is scattered everywhere. The heatmap on the right shows the gradients after being multiplied by \(U\) and \(V^T\). Notice how the activity is now concentrated, leaving vast areas of “green” (near-zero values). This sparsity is what we want to exploit.

2. The SparseGradLinear Layer

Once the transformation matrices \(U\) and \(V\) are found, the researchers replace the standard Linear layers in the MLP blocks with a custom SparseGradLinear layer.

In a standard linear layer, the output \(Y\) is calculated as \(Y = XW^T\). In the SparseGrad layer, the weights are stored in the transformed space. Let’s call the new weights \(\tilde{W}\).

The mathematical relationship is \(\tilde{W}^T = UW^TV^T\).

To make this work during training without breaking the network, the layer effectively becomes a sandwich of three operations:

Transform the input (fixed operation).
Apply the trainable sparse weights (core operation).
Transform the output back to the original space (fixed operation).

Figure 1: Signal propagation in the original Linear Layer vs. the SparseGradLinear layer.

Figure 1 illustrates this flow. In the SparseGradLinear Layer (bottom row), the input passes through \(U^T\), then the trainable weights \(\tilde{W}^T\), and finally \(V\). Crucially, during backpropagation, gradients for \(U\) and \(V\) are not computed. They are frozen. We only update \(\tilde{W}\), and because we are in the “sparse” space, we only need to update the top ~1% of influential values in \(\tilde{W}\).

To ensure that PyTorch handles the gradients correctly across these transformations, the authors derived the specific Autograd rules:

Table 2: Correspondence of variables in Torch Autograd for a regular Linear layer and SparseGradLinear.

3. Sparse-by-Dense Optimization

Simply having mathematical sparsity isn’t enough; computers need to know how to skip the zeros to save time and memory.

The researchers analyzed the structure of the gradients in the new space and found a “strided” structure. As shown in Figure 3 below, entire rows often contain zeros.

Figure 3: Strided structure of gradients (left) and nonzero element percentage (right).

The histogram on the right confirms that for the vast majority of the training, the percentage of nonzero elements remains incredibly low (often under 1%).

To leverage this, they implemented a Sparse-by-Dense matrix multiplication strategy. Instead of performing a full matrix multiplication (which would waste time multiplying zeros), they select only the indices of rows and columns that contain nonzero elements.

The calculation for the updated matrix \(C\) is performed as follows:

Equation 1: Sparse multiplication logic

The resulting values are then mapped back to a Coordinate Format (COO) for the optimizer:

Equation 2: COO mapping

This trick is vital. Without it, SparseGrad would just be a theoretical mathematical improvement. With it, it becomes a practical, high-speed training method.

Performance and Efficiency

So, how does SparseGrad stack up against the titans of PEFT like LoRA and other selective methods like MeProp?

Speed and Memory

One of the biggest selling points of LoRA is that it reduces memory usage significantly. SparseGrad aims to compete here while targeting the denser MLP layers.

Table 3: Training speed and memory requirements averaged on the GLUE benchmark.

Table 3 shows the breakdown on the GLUE benchmark.

Regular Fine-Tuning (FT): Uses the most memory (1345 MB) and is slower.
LoRA: The most memory-efficient (944 MB) and fastest.
SparseGrad (SD - Sparse by Dense): Sits comfortably in the middle. It uses about 12% less memory than regular fine-tuning and is significantly faster than the regular implementation of SparseGrad.

While LoRA still holds the crown for absolute minimum memory usage (saving ~30%), SparseGrad achieves a respectable ~20% saving compared to full fine-tuning. This makes it a viable option for hardware-constrained environments, especially considering the performance gains we discuss next.

Accuracy on NLU Tasks (GLUE)

Efficiency is useless if the model becomes stupid. The researchers tested SparseGrad on the GLUE benchmark (a suite of natural language understanding tasks) using BERT and RoBERTa models.

Table 5: Average scores over the GLUE benchmark for BERT and RoBERTa base models.

Table 5 highlights a key finding: SparseGrad outperforms LoRA.

On BERT, SparseGrad achieved an average score of 82.6, compared to LoRA’s 81.6. It even marginally beat the Full Fine-Tuning (82.5), suggesting that the sparsity constraint acts as a beneficial regularizer (preventing overfitting).
On RoBERTa-base, SparseGrad (83.6) again beat LoRA (83.1).

The trend continues with larger models.

Table 4: Comparative results of RoBERTa large.

In Table 4, using the RoBERTa-large model, SparseGrad achieves an average of 92.4 on the STSB task, beating LoRA’s 92.1. Across the board, SparseGrad proves that updating MLP layers in a smart way yields better representations than updating Attention layers via low-rank approximation.

Generative Tasks: LLaMa-2

To prove this isn’t just for older encoder-only models, the authors applied SparseGrad to LLaMa-2 (7B parameters) for a Question-Answering task using the OpenAssistant dataset.

Table 6: Comparative results for LLaMa-2 on the OpenAssistant-1 dataset.

Table 6 shows the results evaluated by GPT-4 (using the MT-Bench protocol).

SparseGrad Score: 5.132
LoRA Score: 5.025
Regular FT Score: 4.407

Surprisingly, both PEFT methods beat the Full Fine-Tuning (likely due to overfitting in the full model on limited data), but SparseGrad took the top spot.

The blog post includes an interesting qualitative example in the Appendix. When asked to write a persuasive email for an introverted friend, the LoRA model wrote a generic email. The SparseGrad model, however, specifically acknowledged the friend’s anxiety about public speaking (“As you know, public speaking can be a nerve-wracking experience…”), demonstrating a higher level of nuance and instruction-following capability.

Conclusion: A New Tool for the Toolkit

The introduction of SparseGrad teaches us an important lesson about Deep Learning: sometimes the standard coordinate system we use to view our data isn’t the most efficient one.

By mathematically rotating the gradients of the massive MLP blocks, SparseGrad reveals that these dense layers are actually quite sparse in information content. We don’t need to update every single parameter to get state-of-the-art performance; we just need to find the right parameters.

Key Takeaways:

MLPs Matter: Ignoring MLP blocks in fine-tuning leaves performance on the table.
Basis Transformation: Transforming gradients via SVD creates high sparsity (~99% zeros).
Efficiency vs. Performance: SparseGrad is slightly heavier on memory than LoRA but consistently delivers better model performance on both NLU and Generative tasks.

For students and researchers looking to fine-tune models where final accuracy is paramount, SparseGrad offers a compelling alternative to LoRA. It reminds us that in the era of massive models, looking at the problem from a different angle (literally, a different vector basis) can make the impossible manageable.

The Elephant in the Room: MLP Blocks#

The Core Intuition: Changing the Perspective#

The SparseGrad Method#

1. The Preliminary Phase: Finding the Basis#

2. The SparseGradLinear Layer#

3. Sparse-by-Dense Optimization#

Performance and Efficiency#

Speed and Memory#

Accuracy on NLU Tasks (GLUE)#

Generative Tasks: LLaMa-2#

Conclusion: A New Tool for the Toolkit#