Large Language Models (LLMs) like LLaMA and GPT-4 have revolutionized natural language processing, but their power comes at a cost. With billions of parameters, these models are computationally expensive to run, memory-hungry, and environmentally demanding. This has sparked a race to make them smaller, faster, and more efficient—without sacrificing their remarkable performance.

Two popular techniques for model compression are quantization (using fewer bits to represent numbers) and network pruning (removing weights entirely). While quantization has seen rapid progress in the LLM space, pruning has lagged behind. Why? Because most traditional pruning methods require costly retraining to recover lost accuracy, which is often prohibitively expensive for billion-parameter models. Even the simplest method—removing the smallest weights, known as magnitude pruning—fails dramatically on LLMs. This suggests that large language models are uniquely sensitive to this kind of simplification.

This is where a new approach from researchers at CMU, Meta AI, and Bosch AI comes in. They introduce a method called Wanda (Pruning by Weights and Activations), which is remarkably simple, requires no retraining or weight updates, and proves to be incredibly effective. Wanda challenges the long-held assumption that pruning decisions should depend solely on weight magnitudes. By considering the input activations in addition, it provides a much clearer picture of a weight’s true importance—unlocking an efficient and powerful way to create sparse, high-performing LLMs.

In this post, we’ll explore how Wanda works, why it succeeds where others stumble, and what its results reveal about the hidden structure of large language models.


Background: The Pruning Problem and an LLM-Specific Clue

Before unpacking Wanda, let’s review two ideas that motivated its design.

Magnitude Pruning: The Classical Baseline

Magnitude pruning is the most intuitive way to make a neural network sparse. The principle is simple: weights with small absolute values contribute less to the model’s outputs than those with large absolute values. To prune, you set a sparsity target (say, 50%), rank all weights by their magnitudes, and remove the smallest ones.

This method has long been a strong baseline, often followed by fine-tuning to let the remaining weights readjust. But when applied to LLMs, magnitude pruning collapses. A 50% sparse LLaMA model constructed this way performs disastrously, revealing that in LLMs, raw weight size doesn’t tell the whole story.

The Clue: Emergent Large-Magnitude Features

So what else determines importance? A clue comes from a fascinating property in very large Transformer models—emergent large-magnitude features. Researchers at FAIR observed that once a model reaches around six billion parameters, a small subset of feature dimensions develops activation values that are up to 100× larger than average.

These “outlier” features aren’t noise—they’re essential. If you mask them, the model’s predictive ability collapses. This discovery explained why standard quantization methods often failed: they couldn’t represent these extreme values accurately. But it also suggested something deeper. If certain input features have huge magnitudes, then the weights connected to them are likely critical, regardless of their own size.


The Core of Wanda: Using Activations to Judge Weights

Armed with this insight, the authors built Wanda—a hybrid of simplicity and data-awareness. It adds a new dimension to pruning by factoring in both the weights and their associated input activations. Wanda has two key components: a smarter pruning metric and a more localized comparison strategy.

Let’s start with a motivating example.

Imagine a single neuron computes:

\[ y = w_1x_1 + w_2x_2 \]

Magnitude pruning would look only at the weights: if \(|w_1| < |w_2|\), it prunes \(w_1\). But what if \(x_1\) is an outlier feature, massively larger than \(x_2\)? In that case, \(|w_1x_1|\) might exceed \(|w_2x_2|\), meaning \(w_1\) contributes much more to the output. Pruning \(w_1\) would be the wrong decision.

This simple example exposes magnitude pruning’s flaw: it ignores the scale of the data flowing through the weights. Wanda fixes this by weighting each connection’s importance using both weight magnitude and activation magnitude.


1. The Wanda Metric: Weights × Activations

For each weight \(\mathbf{W}_{ij}\) connecting input neuron j to output neuron i, Wanda assigns an importance score:

\[ \mathbf{S}_{ij} = |\mathbf{W}_{ij}| \cdot \|\mathbf{X}_j\|_2 \]

A mathematical equation showing the Wanda pruning metric: S_ij = |W_ij| * ||X_j||_2.

Each weight is scored by multiplying its magnitude by the norm of its corresponding input activation.

Let’s break it down:

  • \(|\mathbf{W}_{ij}|\): absolute value of the weight.
  • \(\|\mathbf{X}_j\|_2\): L2 norm of the j-th input feature’s activations, computed over a small calibration batch (often 128 sequences from C4).

Their product yields a simple yet powerful measure of a weight’s true influence. If an input dimension consistently produces large activations, its associated weights receive higher importance scores—even if the weights themselves are small.

This makes Wanda remarkably robust: the L2 norms can be estimated from a tiny calibration set, without retraining or access to the original training data.


2. Per-Output Pruning: Comparing Weights Locally

Once every weight has a score, we decide which to remove. Most pruning methods compare weights across an entire layer, keeping the largest ones globally (layer-wise pruning). Wanda instead makes comparisons per output neuron.

For each output neuron, Wanda prunes its incoming weights independently—removing, say, the lowest-scoring half out of all its inputs.

A diagram comparing Magnitude Pruning and Wanda. Magnitude pruning groups all weights together, while Wanda groups them by output row and multiplies by activation norms before pruning.

Wanda prunes weights locally per output neuron, multiplying by input activation norms before ranking. This yields a more balanced and effective sparsity pattern.

Surprisingly, this local pruning rule works far better for LLMs. Keeping each output neuron evenly connected preserves structural balance within layers—a property seemingly critical for these models’ stability. Interestingly, this advantage doesn’t extend to vision models, underscoring its LLM-specific nature.


The Wanda Procedure: A One-Pass Algorithm

Putting it all together, the Wanda pruning process is exceptionally straightforward:

  1. Collect Activations: Run a forward pass on a small calibration batch to capture input activations for each layer.
  2. Compute Norms: Calculate the L2 norm for each input dimension, forming a per-feature scale vector.
  3. Score Weights: Compute Wanda’s importance scores as \(|W_{ij}| \cdot \|\mathbf{X}_j\|_2\).
  4. Prune Per Output: Within each row of the weight matrix, set the lowest-scoring weights to zero until your target sparsity is reached.

No gradients. No iterative updates. After pruning, the model can be used immediately.


Theory Connection: Simplifying Second-Order Pruning

Wanda’s simplicity hides a fascinating theoretical connection to complex second-order methods. SparseGPT, for instance, formulates pruning as solving a layer-wise reconstruction problem using Hessian (second-order derivative) information. Its pruning metric depends on matrix inverses of the Hessian:

The SparseGPT pruning metric equation.

Computing those inverses is immensely expensive. The Wanda authors show that if you assume the Hessian is diagonal and remove regularization (\(\lambda = 0\)), the SparseGPT metric simplifies to the square of the Wanda metric:

A derivation showing how the SparseGPT metric simplifies to the Wanda metric under certain assumptions.

Under mild assumptions, the computationally heavy SparseGPT metric reduces to Wanda’s simple product form, providing a practical approximation to second-order pruning.

This result grounds Wanda theoretically: it captures the essential information of second-order methods while eliminating their computational burden.

A comparison of pruning methods summarizes this tradeoff:

A table comparing Magnitude, SparseGPT, and Wanda pruning methods on weight updates, calibration data, pruning metric, and complexity.

Wanda combines theoretical elegance with practical efficiency—achieving similar results at fraction-of-time cost.


Experiments and Results

The authors evaluated Wanda extensively on LLaMA and LLaMA‑2 models, comparing against magnitude pruning and SparseGPT.

Zero‑Shot Accuracy

On seven standard zero-shot tasks, Wanda dramatically outperforms magnitude pruning and matches SparseGPT—all without any weight updates.

Table showing mean zero-shot accuracies for LLaMA and LLaMA-2 models pruned with different methods. Wanda is highly competitive with SparseGPT.

Wanda consistently performs near or equal to SparseGPT, despite needing no weight updates.

A standout result: for 50% sparsity on LLaMA‑65B and LLaMA‑2‑70B, Wanda’s performance nearly equals that of the original dense models. This suggests large LLMs contain exact sparse sub-networks that can be used directly.


Language Modeling Perplexity

Perplexity measures predictive capability (lower is better). On WikiText, Wanda delivers strong results.

Table showing WikiText perplexity for pruned LLaMA and LLaMA-2 models. Wanda again performs comparably to SparseGPT and much better than magnitude pruning.

Wanda nearly matches SparseGPT while vastly outperforming magnitude pruning.

For example, magnitude pruning gives LLaMA‑7B a perplexity of 17.29 (weak performance), while Wanda achieves 7.26—almost as good as SparseGPT’s 7.22 and far closer to the dense 5.68 model.


Speed and Efficiency

Wanda’s design isn’t only effective—it’s fast.

The time to compute pruning metrics is orders of magnitude smaller than SparseGPT’s.

Table comparing the time in seconds to compute pruning metrics for SparseGPT and Wanda across different model sizes. Wanda is orders of magnitude faster.

Wanda completes pruning hundreds of times faster than SparseGPT—critical for massive models.

SparseGPT takes over 1,350 seconds (22 minutes) on a 65B model, while Wanda finishes in just 5.6 seconds, a 240× speedup.

Structured sparsity (like NVIDIA’s 2:4 scheme) can further accelerate inference. Using Wanda-produced 2:4 masks gives about 1.6× faster matrix multiplications.

Table showing inference speedup in milliseconds for matrix multiplication in LLaMA-65B with 2:4 sparsity.

Wanda enables direct hardware acceleration benefits with structured sparsity.


Diving Deeper: Ablations and Insights

To understand why Wanda is so effective, the authors ran a series of ablation studies.

Fine‑Tuning Improves Sparse Models

First, they tested whether fine‑tuning can restore pruned models’ performance. Even lightweight fine‑tuning methods like LoRA bring noticeable gains.

Table showing that fine-tuning (both LoRA and Full) can significantly recover the performance of a Wanda-pruned model, bringing it very close to the dense original.

LoRA and full fine-tuning close nearly all the gap between Wanda-pruned and dense models.

With LoRA, structured 2:4 sparse LLaMA‑7B increases from 48.5% to 54.5% accuracy. Full fine‑tuning pushes unstructured 50% sparse models to 58.1%, within a whisker of the original 59.9%.


Which Component Matters More?

Is Wanda’s success driven by its activation-based metric or per-output pruning rule? The authors systematically tested combinations.

Table ablating the pruning metric and comparison group. The best performance is achieved by combining Wanda’s metric with its per-output comparison group.

The combination of Wanda’s metric and per-output grouping yields the best results by a clear margin.

Results show both parts contribute strongly, and only together they reach optimal low perplexity.


Robustness to Calibration Data

SparseGPT’s Hessian-based metric needs many calibration samples for stability. Wanda’s norm-based metric, by contrast, works well even with one.

A line chart showing perplexity vs. the number of calibration samples for SparseGPT and Wanda. Wanda is much more robust, performing well even with a single sample.

Wanda’s pruning score is stable with very few samples, making it ideal when data is scarce.

This robustness further underscores Wanda’s practicality for real-world deployments.


Weight Updates: Are They Necessary?

SparseGPT requires iterative weight updates to offset pruned connections. Wanda skips this entirely—but does that hurt?

Table showing the effect of adding a weight update step to magnitude pruning and Wanda. It helps magnitude pruning significantly but offers almost no benefit to Wanda.

Adding weight updates helps magnitude pruning but barely affects Wanda—its pruned models already work “as is.”

For magnitude pruning, updates dramatically improve results; for Wanda, they barely help, confirming that Wanda finds inherently strong sparse subnetworks within the pretrained weights.


Conclusion: The Elegance of Simplicity

Wanda demonstrates that smart simplicity beats complexity for LLM pruning. By incorporating activation magnitudes into the pruning decision and pruning per output neuron, it captures the essence of weight importance in LLMs—efficiently and effectively.

Key takeaways:

  1. Activations matter: Outlier features make activation-aware metrics essential for large models.
  2. Structure matters: Per-output pruning preserves balance and is crucial for stability in LLMs.
  3. Sparse subnetworks exist: Wanda’s strong performance without updates indicates that pretrained LLMs contain ready-to-use sparse architectures.

Fast, data-efficient, and theoretically grounded, Wanda offers a practical baseline for future LLM pruning—and invites deeper exploration of sparsity and efficiency at scale.

In the race to build faster, leaner AI, Wanda reminds us that sometimes the smartest solutions are the simplest.