Introduction

The capabilities of Large Language Models (LLMs) like Llama 3 and Qwen2.5 are growing at a staggering pace. However, as these models scale to hundreds of billions of parameters, the computational cost to run them—specifically during inference—is becoming prohibitive. Inference has two main bottlenecks: the compute-bound prefilling stage (processing your prompt) and the memory-bound generation stage (spitting out tokens one by one).

To make these models run on standard hardware (or just run faster on data center GPUs), we rely on quantization—reducing the precision of the model’s numbers from 16-bit floating-point (FP16) to integers like 8-bit or 4-bit. While quantizing weights is relatively solved, quantizing activations (the temporary data flowing through the network) and the KV cache (the model’s memory of the conversation) to 4-bit without turning the model into gibberish remains a massive challenge.

Why? Because of outliers. In LLM activations, a tiny fraction of values are magnitudes larger than the rest. If you try to squeeze these outliers into a tiny 4-bit grid, you either clip them (losing critical info) or stretch the grid so wide that the small, nuanced values get lost in quantization noise.

In this post, we will deep dive into ResQ (Residual Quantization), a new method presented at standard ML conferences that proposes a mathematically optimal way to handle these outliers. By combining mixed-precision quantization with principal component analysis (PCA), ResQ achieves state-of-the-art performance, effectively unlocking accurate 4-bit inference.

The Background: The War on Outliers

Before understanding ResQ, we need to understand the current landscape of quantization.

The Problem

When we quantize a matrix \(X\), we map its continuous values to a discrete set of integers. The standard equation looks like this:

Equation for quantization showing the rounding and scaling process.

Here, \(s_X\) is the scale factor. If matrix \(X\) has one massive outlier (e.g., 100) while most values are between -1 and 1, \(s_X\) must be large to accommodate the 100. Consequently, the values between -1 and 1 might all get rounded to 0, destroying the signal.

Existing Solutions

Researchers have developed two primary strategies to fight this:

  1. Mixed-Precision (Outlier Channel Detection): Identify the specific channels (columns/rows) where outliers live and keep them in high precision (e.g., 8-bit or 16-bit), while crunching the rest to 4-bit. The challenge here is: How do you decide which channels are “important”? Most methods just look for the largest values (\(\ell_{\infty}\)-norm).
  2. Rotation (Uniform Precision): Multiply the activation matrix by a random rotation matrix. This “smears” the outliers across all channels, making the distribution smoother (more Gaussian) and easier to quantize uniformly.

Comparison of quantization approaches: (a) Mixed-precision based on outliers, (b) Rotation based, and (c) ResQ combining both.

As shown in Figure 1 above, existing methods usually pick one lane. (a) shows outlier detection, where specific high-magnitude channels are kept in orange (high precision). (b) shows rotation, where the data is scrambled to be uniform.

ResQ (c) asks: Why not do both? But specifically, why not use a better metric than just “magnitude” to decide what to keep in high precision?

The Core Method: ResQ

ResQ stands for Residual Quantization. The core philosophy is to identify a “low-rank subspace” that contains the most information (variance), keep that in high precision, and relegate the “residual” (the rest) to low precision. Crucially, it uses invariant random rotations within those subspaces to smooth out the data even further.

The Decomposition

The researchers propose splitting the input activations \(X\) and weights \(W\) using an orthogonal basis \(U\). They split this basis into two parts:

  • \(U_h\): The high-precision subspace (rank \(r\)).
  • \(U_l\): The low-precision subspace (rank \(d-r\)).

The quantized activation \(X_q\) is calculated as:

Equation showing the decomposition of quantized X into low and high precision components.

This equation says: Project \(X\) into the low-precision space (\(X U_l\)) and quantize it aggressively (\(Q_L\)), then project \(X\) into the high-precision space (\(X U_h\)) and quantize it gently (\(Q_H\)). Sum them up, and you have your result.

Why Orthogonality Matters

You might be wondering, “Doesn’t splitting matrices and multiplying them add massive computational overhead?”

This is where the math gets elegant. Because the basis \(U\) is orthogonal, the cross-terms in the matrix multiplication vanish. When you multiply the quantized activation \(X_q\) by the quantized weight \(W_q\), the operation decomposes cleanly:

Matrix multiplication decomposition showing how cross-terms vanish, leaving only Low-Low and High-High interactions.

This means the hardware only needs to perform a 4-bit GEMM (General Matrix Multiply) for the bulk of the data and an 8-bit GEMM for the small high-precision part. There is no messy interaction between the two precisions.

Figure 2 below illustrates this hardware-friendly flow. Notice how the large blue block (4-bit) and the thin teal block (8-bit) are processed separately and then simply added together.

Diagram of matrix multiplication with mixed precision operands.

The Secret Sauce: PCA and Optimality

The most significant contribution of this paper is how they choose the high-precision subspace \(U_h\). Previous methods like QUIK selected channels based on maximum absolute value (magnitude). ResQ uses Principal Component Analysis (PCA).

The authors prove a theorem (Theorem 4.2) demonstrating that to minimize quantization error, you shouldn’t look for the largest values; you should look for the largest variance.

Equation showing the upper bound of quantization error.

The equation above provides the upper bound of the error. To make the error (LHS) as small as possible, you need to subtract as much as possible on the RHS. This implies maximizing \(\|XP_h\|_F\). In linear algebra, the projection \(P_h\) that maximizes the Frobenius norm of the projected data is exactly the top eigenvectors of the covariance matrix \(XX^T\).

In simple terms: ResQ runs a quick calibration step (using PCA) to find the “directions” in the data that fluctuate the most. It assigns those directions to 8-bit precision. It assigns the boring, static directions to 4-bit.

Adding Rotation

Once the subspaces are identified via PCA, ResQ applies random rotations (\(R_l\) and \(R_h\)) inside those subspaces.

Definition of the U matrix as a product of Projection (P) and Rotation (R).

This rotation ensures that within the 4-bit group, no single channel is an outlier, and within the 8-bit group, the data is also well-distributed.

The Effect on Distributions

Does this complex math actually change the data? Yes, dramatically.

Figure 3 below shows the activation distribution.

  • (a) The baseline is jagged and noisy.
  • (b) Applying PCA (\(XP\)) sorts the channels by variance. You can see the right side (high variance) spikes up.
  • (c) Applying ResQ (\(XU\), which includes rotation) smooths everything out. The “INT4” section is flat and easy to quantize; the “INT8” section captures the heavy lifting.

Plots showing activation distribution: Baseline vs. Projected vs. ResQ.

Implementation in LLM Blocks

Implementing this in a Transformer isn’t as simple as just one matrix multiplication. The projections need to be fused into the weights where possible to avoid slowdowns.

The authors define specific projection matrices for different parts of the model:

  1. \(U_A\): For the inputs to Attention and Feed-Forward (FFN) blocks.
  2. \(U_B, U_C\): For the Value and Key heads in the KV cache.
  3. \(U_D\): For the massive down-projection in the FFN.

Diagram of a transformer block showing where U_A, U_B, U_C, and U_D are applied.

Key Engineering Trick: For \(U_D\), which operates on the hidden dimension of the FFN (which is huge), doing a full matrix multiplication is too slow. The authors smartly choose \(U_D\) to be a Hadamard matrix. Hadamard transforms can be computed using fast, specialized kernels (Fast Walsh-Hadamard Transform), making the projection almost free computationally.

Experiments & Results

The researchers tested ResQ on Llama 2, Llama 3, Llama 3.2, and Qwen2.5 families. The setup generally quantizes Weights, Activations, and KV Cache all to 4-bit (W4A4KV4), keeping only 1/8th of the channels in 8-bit.

Perplexity and Accuracy

The results show a clear victory over competing methods like SpinQuant, QuaRot, and QUIK.

In Table 1, look at the Meta-Llama-3-70B column.

  • RTN (Round-to-Nearest) breaks the model completely (perplexity > 400).
  • GPTQ (Weight only) fails because activations aren’t handled.
  • SpinQuant (the previous state-of-the-art) gets 6.2 perplexity.
  • ResQ achieves 4.1 perplexity, significantly closer to the 16-bit baseline.

Table 1: Comparison of perplexity and accuracy across Llama models.

Generative Capabilities

It’s one thing to have good perplexity (predicting the next word), but can the model still do math and code?

Table 2 highlights performance on GSM8K (math) and HumanEval type tasks (code). For Llama-3-8B, ResQ scores 33.6% on GSM8K, beating SpinQuant’s 29.8% and QUIK’s 2.3% (which collapsed completely). This confirms that ResQ preserves the model’s “reasoning” abilities far better than other 4-bit techniques.

Table 2: Performance on generative tasks like GSM8K and LongBench.

Visualizing the Improvement

To visualize why ResQ works better, we can look at the signal-to-noise ratio (SNR) and the actual activation values.

Figure 7 compares the input activations.

  • Top (Baseline): The values range wildly from -15 to +10.
  • Bottom (ResQ): The values are tightly controlled, mostly staying between -5 and +5. This compact range is much “friendlier” for 4-bit quantization.

Comparisons of input activation distributions for Attention and FFN layers.

Speedup

Finally, the “Post-Training Quantization” (PTQ) promise is speed. Does ResQ deliver?

On an NVIDIA RTX 3090, ResQ achieves up to a 3x speedup over the 16-bit baseline. Crucially, it is only about 14% slower than a pure naive INT4 kernel. This means the overhead of the mixed-precision handling (splitting the matrix) and the on-the-fly projections is negligible compared to the gains from reducing memory bandwidth.

Bar chart showing speedup of ResQ on RTX 3090.

Conclusion & Implications

ResQ represents a maturation in the field of LLM quantization. We have moved past simple rounding (RTN) and static outlier clipping. We are now entering an era where quantization is “aware” of the data structure.

By mathematically separating the high-variance “signal” from the low-variance “noise” using PCA, ResQ allows us to spend our “bit budget” where it matters most.

Key Takeaways:

  1. Don’t just look at magnitude: Variance (PCA) is a better indicator of importance for mixed-precision than simple absolute values.
  2. Orthogonality is efficient: Decomposing matrices into orthogonal subspaces allows for mixed-precision without complex cross-calculations.
  3. Rotation aids quantization: Even after finding the best subspaces, random rotation helps smooth out the remaining outliers.
  4. 4-bit W/A/KV is viable: We are closing the gap to 16-bit performance, making it feasible to run massive models like Llama-3-70B on consumer-grade hardware or significantly cheaper cloud instances.

As LLMs continue to grow, techniques like ResQ that optimize the inference stage are likely to become standard components of model deployment pipelines.