SparseGPT: How to Delete 100 Billion Parameters from an LLM with No Retraining

Large Language Models (LLMs) like those in the GPT family have revolutionized AI, showing remarkable intelligence across tasks—from writing code to summarizing documents and generating creative text. But this power comes at a staggering computational cost. The largest open-source models, such as OPT‑175B and BLOOM‑176B, contain more than 175 billion parameters, demanding enormous storage and memory capacity. Running such models for inference can require multiple high-end GPUs—for example, five NVIDIA A100s with 80 GB memory each—placing them well out of reach for most developers and researchers.

How can we make these massive models more accessible?

The answer lies in model compression. One widely used strategy, quantization, reduces the precision of a model’s weights (for example, from 16-bit numbers to 4-bit values). A complementary approach, pruning, removes some of the model’s connections entirely—making its weight matrices sparse.

However, pruning giant LLMs has proven daunting: most accurate pruning methods depend on extensive retraining to restore lost performance, which is unaffordable for trillion-scale computations. Simpler one‑shot pruning techniques exist—they prune without retraining—but either fail to maintain acceptable accuracy or are far too slow for billion‑parameter models.

This is the challenge addressed by the groundbreaking paper SparseGPT: Massive Language Models Can be Accurately Pruned in One‑Shot. The researchers introduce SparseGPT, the first method that can accurately and efficiently prune massive language models in a single pass, with no fine‑tuning. The results are astonishing: SparseGPT can remove over 100 billion parameters—achieving up to 60% sparsity—with negligible accuracy loss.

In this article, we’ll unpack how SparseGPT works, step through its core algorithm, and explore why its results represent a milestone in efficient large‑scale model deployment.

Understanding the Challenge: Why Pruning Breaks Down at Scale

Modern neural networks are typically compressed layer by layer—a process known as layer‑wise pruning. Instead of tackling all parameters at once, we work on each layer individually. For a given layer with weight matrix \( \mathbf{W}_{\ell} \), the goal is to find a sparse version that behaves almost identically to the original. Mathematically, we aim to find a binary mask \( \mathbf{M}_{\ell} \) (0 for pruned weights, 1 for kept weights) and updated weights \( \widehat{\mathbf{W}}_{\ell} \) that minimize output differences.

The layer-wise pruning objective is to find a mask and updated weights that minimize the difference in output between the original and pruned layer.

Figure: Pruning at the layer level seeks to reconstruct the original layer output as closely as possible after removing selected weights.

Solving this optimization involves two interlocking steps:

Mask Selection: Decide which weights to prune—often using magnitude pruning, which simply removes the smallest‑valued weights.
Weight Reconstruction: Adjust the remaining weights to compensate for lost information.

The difficulty is that these steps depend on each other: the best mask depends on how well we can reconstruct, and vice versa. Sophisticated approaches like Optimal Brain Compression (OBC) attempt to handle both jointly—removing weights one at a time and perfectly reconstructing the rest—but this scales terribly. For a model like OPT‑175B, OBC would require weeks or months.

Even faster methods, such as AdaPrune, still take hours to prune just a billion‑parameter model, with runtime scaling linearly with model size. Extrapolated to 175 B parameters, this would take several weeks on a single GPU.

SparseGPT overcomes these limitations through a new algorithm that drastically accelerates weight reconstruction—making highly accurate one‑shot pruning feasible for models 100× larger than previous limits.

The Core Innovation: How SparseGPT Works

SparseGPT reframes layer‑wise pruning as an approximate sparse regression problem—and introduces a clever mechanism to make solving it practical for enormous matrices.

1. The Ideal but Impractical Solution

If we fix a pruning mask \( \mathbf{M} \), the exact optimal updates to the remaining weights can be computed by solving a regression for each row \( \mathbf{w}^i \) of the weight matrix:

The exact formula for reconstructing the unpruned weights in a single row. This requires inverting a matrix derived from the input data.

Figure: Exact reconstruction requires inverting a distinct Hessian matrix for each row—computationally prohibitive for large models.

Here, \( \mathbf{X}_{\mathbf{M}_i}\mathbf{X}_{\mathbf{M}_i}^{\top} \) is the Hessian matrix capturing second‑order curvature information. The challenge is that every row has a different pruning mask \( \mathbf{M}_i \), meaning a distinct Hessian inversion per row. In a large Transformer layer where the hidden size \( d_{\text{hidden}} \) exceeds 10,000, this translates to over ten thousand separate matrix inversions of dimension \(10{,}000\times10{,}000\)—completely infeasible.

Illustration of the row-Hessian challenge. Each row has a different pruning mask (white indicates pruned weights), requiring separate, costly inversions.

Figure: Each row’s unique pruning pattern prevents reuse of Hessian information, making scalability impossible.

2. Observing the Iterative Structure: Optimal Brain Surgeon (OBS)

To circumvent the bottleneck, the authors looked to the Optimal Brain Surgeon (OBS) framework. OBS describes how to optimally update the remaining weights when pruning one specific weight \( w_m \), using precomputed inverse‑Hessian information. The OBS update provides an exact compensation vector \( \boldsymbol{\delta}_m \) and the associated error \( \boldsymbol{\varepsilon}_m \):

The Optimal Brain Surgeon (OBS) update rules for removing a single weight. The update depends on the weight’s value and the inverse Hessian terms.

Figure: The OBS rule shows how the remaining weights adapt when one parameter is removed.

Applying OBS repeatedly—one weight at a time—could in principle reproduce the exact optimal solution. But iterating this for billions of weights would still be far too slow.

3. SparseGPT’s Breakthrough: Hessian Synchronization

SparseGPT introduces a powerful simplification. Instead of updating all weights after each pruning step, the algorithm only updates weights to the right of the pruned column while freezing those on the left.

This column‑wise progression means all rows share the same subset of “active” weights during each update step. The result: the same inverse Hessian information can be reused for all rows.

A visualization of the SparseGPT algorithm. It processes the matrix column by column. When a weight is pruned (dark blue), it updates later columns, while frozen weights stay fixed. The shared inverse Hessians (orange) are reused across rows.

Figure: SparseGPT processes columns sequentially, updating only unfrozen weights—allowing efficient reuse of inverse Hessian blocks.

When moving from column \( j \) to \( j+1 \), the algorithm performs a lightweight update of the inverse Hessian requiring only \( O(d_{\text{col}}^2) \) time—drastically faster than recomputing it from scratch. This Hessian synchronization turns a computation that would scale as \( O(d_{\text{hidden}}^4) \) into one scaling as \( O(d_{\text{hidden}}^3) \), a full order‑of‑magnitude speedup.

Practically, SparseGPT can prune 175‑billion‑parameter models in under five hours on a single A100 GPU—something previously unthinkable.

Going Beyond Magnitude: Adaptive Mask Selection

SparseGPT doesn’t just prune efficiently—it chooses which weights to remove intelligently.

Instead of simply discarding the smallest weights, the algorithm estimates each weight’s pruning impact via the OBS error term \( \boldsymbol{\varepsilon}_m \), which reflects how well the remaining weights can compensate. This makes it possible to identify low‑impact weights that aren’t necessarily small in magnitude.

To keep computation tractable, SparseGPT performs this selection block‑wise—for example, evaluating 128 columns at a time. This “iterative blocking” enables adaptive, non‑uniform sparsity distribution across columns while still preserving parallel efficiency. Sensitive columns can remain denser, while redundant ones are pruned more aggressively.

Structured Sparsity and Joint Quantization

SparseGPT naturally extends to semi‑structured sparsity patterns such as NVIDIA’s 2:4 format, where in every group of four consecutive weights exactly two are zero. The algorithm simply applies its OBS‑based selection criterion within these fixed blocks.

Even more remarkably, SparseGPT’s internal structure mirrors that of GPTQ, the leading post‑training quantization method. Since both rely on inverse‐Hessian updates, they can be merged into a single pass that performs pruning and quantization simultaneously.

This joint procedure yields compressed models that are both sparse and low‑precision, with no extra computational overhead—a powerful step toward fully optimized LLMs.

Experimental Results: SparseGPT in Action

The authors extensively benchmarked SparseGPT on the OPT and BLOOM model families, ranging from 125 M to 176 B parameters. The findings redefine what’s possible in large‑scale compression.

1. Superior Accuracy at High Sparsity

A comparison of SparseGPT and magnitude pruning on OPT‑175B. SparseGPT retains near‑baseline perplexity up to 60% sparsity, while magnitude pruning collapses early.

Figure 1: SparseGPT maintains accuracy up to 60 % sparsity—more than six times further than magnitude pruning.

Perplexity, the standard measure of language model quality, skyrockets for magnitude‑pruned models beyond 10–30 % sparsity. SparseGPT, in contrast, remains nearly indistinguishable from the dense baseline up to 60 %. In practical terms, 100 billion weights can be removed without degrading text modeling quality.

2. Bigger Models Are Easier to Prune

Perplexity versus model size for different sparsity patterns. As models grow (right), all sparse curves converge to the dense baseline, showing that larger models are more compressible.

Figure 2: The performance gap between dense and sparse models vanishes as scale increases.

A striking discovery: larger models lose less accuracy for the same sparsity level. Across OPT variants from 125 M to 175 B parameters, the accuracy gap narrows dramatically, disappearing entirely for the largest models.

This suggests that as model size increases, redundancy grows faster than complexity, making enormous LLMs surprisingly easy to compress—an encouraging sign for efficient scaling.

3. Zero‑Shot Evaluation on Downstream Tasks

SparseGPT’s pruned models preserve not only language modeling ability but also downstream task competence.

Zero‑shot accuracy across tasks for OPT‑175B. Magnitude pruning fails completely, while SparseGPT matches dense performance—even under structured sparsity.

Figure 3: SparseGPT retains high zero‑shot accuracy across multiple benchmarks despite heavy pruning.

Across benchmarks like Lambada, PIQA, ARC, and StoryCloze, magnitude‑pruned models perform near random chance, while SparseGPT’s 50% sparse and 2:4 structured versions stay within a fraction of a percent of dense performance.

4. Merging Sparsity with Quantization

When the team combined 50% sparsity with 4-bit quantization, the resulting model occupied the same memory as a dense 3-bit quantized model—but achieved higher accuracy.

Comparing a joint 50% sparse + 4-bit model with a 3-bit quantized equivalent. The hybrid approach consistently outperforms pure low-bit quantization for large models.

Figure 4: The combined approach (50% sparse + 4‑bit) surpasses traditional 3‑bit quantization at identical memory usage.

This synergy between pruning and quantization demonstrates that hybrid compression can outperform even specialized low‑bit alternatives.

Implications and Outlook

SparseGPT delivers a decisive proof of concept: accurate, efficient one‑shot pruning is possible for gigantic LLMs.

Key takeaways:

Scalable One‑Shot Pruning Works. Through Hessian sharing and approximate updates, SparseGPT scales pruning from millions to hundreds of billions of parameters.
Over‑parameterization Enables Compression. The finding that larger models are easier to prune highlights deep intrinsic redundancy within modern architectures.
Gateway to Affordable Inference. Reducing active weights by 50–60% paves the way for faster and cheaper deployment—especially on CPUs or specialized sparse hardware.

The broader implication is profound: dense LLMs contain highly capable sparse subnetworks that can be uncovered directly, without additional training. SparseGPT identifies these subnetworks efficiently, accelerating the path toward democratized access to high‑performance language models.

By revealing that pruning and quantization can coexist seamlessly, SparseGPT charts a practical route to running world‑class models on everyday hardware—all while preserving near‑original capability.

Understanding the Challenge: Why Pruning Breaks Down at Scale#

The Core Innovation: How SparseGPT Works#

1. The Ideal but Impractical Solution#

2. Observing the Iterative Structure: Optimal Brain Surgeon (OBS)#

3. SparseGPT’s Breakthrough: Hessian Synchronization#

Going Beyond Magnitude: Adaptive Mask Selection#

Structured Sparsity and Joint Quantization#

Experimental Results: SparseGPT in Action#

1. Superior Accuracy at High Sparsity#

2. Bigger Models Are Easier to Prune#

3. Zero‑Shot Evaluation on Downstream Tasks#

4. Merging Sparsity with Quantization#

Implications and Outlook#