Introduction
The race to scale Large Language Models (LLMs) has hit a physical wall: GPU memory. With models now routinely exceeding 50 billion parameters, the computational resources required to fine-tune them for specific tasks are astronomical. A 65B parameter model, for instance, is not something you can easily load, let alone train, on a standard consumer GPU.
To address this, the community turned to Parameter-Efficient Finetuning (PEFT) and Quantization. Methods like QLoRA (Quantized Low-Rank Adaptation) have become the industry standard, allowing us to freeze a model, compress it to 4 bits, and train a tiny set of adapter parameters. This was a massive leap forward.
However, a new problem has emerged. While 4-bit quantization works reasonably well, pushing models down to 2-bit or 3-bit usually results in a complete collapse of performance. The model suffers from “catastrophic forgetting”—it loses the knowledge it learned during pretraining.
Why does this happen? The current state-of-the-art methods distort the model’s internal representations too much during the compression process. They break the “starting point” of the model.
In this post, we will deep dive into ApiQ (Activation-preserved initialization of Quantized LLMs), a novel framework proposed by researchers from the University of Amsterdam and eBay. ApiQ introduces a mathematically rigorous way to initialize quantization and adapter parameters simultaneously. The result? A 2-bit model that can actually be fine-tuned effectively, outperforming existing baselines by a significant margin.
The Background: LoRA, QLoRA, and the “Starting Point”
To understand ApiQ, we first need to understand the mechanics of the methods it improves upon.
The Standard: LoRA
Low-Rank Adaptation (LoRA) freezes the massive pretrained weight matrix \(W\) and injects trainable low-rank matrices, \(A\) and \(B\), alongside it. The new forward pass looks like this:
\[W' = W + AB^\top\]Usually, \(B\) is initialized to zero, meaning at step zero of training, \(W' = W\). This is crucial: the training starts exactly where the pretrained model left off.
The Problem with QLoRA
QLoRA takes this a step further by quantizing \(W\) into a lower-precision format (like 4-bit integers), denoted as \(Q\). The equation becomes:
\[W' = Q + AB^\top\]Here lies the issue. Quantization is lossy. \(Q\) is not equal to \(W\). Therefore, at the beginning of training (even with \(B=0\)), \(W' \approx Q \neq W\).
This discrepancy means the starting point is broken. The model no longer behaves exactly like the pretrained model. When you push quantization to extreme limits (like 2-bit), the difference between \(W\) and \(Q\) becomes massive, leading to high error rates that fine-tuning struggles to recover from.
The Partial Solution: LoftQ
Recent works like LoftQ attempted to fix this by initializing \(Q, A, \text{and } B\) such that they approximate the original weights as closely as possible:
\[ \text{Minimize } \| W - (Q + AB^\top) \| \]While this reduces the weight error, it treats every layer independently. It ignores a critical reality of deep neural networks: error accumulation. A small error in layer 1 alters the input to layer 2, which causes more error, eventually cascading through the network.
The Core Method: ApiQ
The researchers behind ApiQ argue that minimizing weight error (like LoftQ) is the wrong objective. We shouldn’t care if the weights look different; we should care if the activations (the outputs of the layers) are preserved.
If a quantized layer produces the exact same output \(Y\) as the full-precision layer for a given input \(X\), the network won’t know the difference.
1. Activation-Preserved Initialization
The core optimization problem of ApiQ is formulated to minimize the difference between the output of the full-precision layer and the quantized layer.

Where:
- \(X\): The input activation.
- \(W\): Fixed pretrained weights.
- \(X^q\): The input to the quantized layer (which comes from the previous quantized layer).
- \(Q, A, B\): The parameters we are optimizing.
This effectively aligns the activations across corresponding layers. Crucially, because \(X^q\) is the output from the previous quantized layer, ApiQ accounts for the errors flowing from shallower layers into deeper ones. It actively mitigates error propagation.
Visualizing the Error Reduction
The difference in approach leads to drastic differences in internal error rates.

In Figure 4, look at the scale of the Y-axis. The activation error for standard QLoRA (orange line) and LoftQ (blue line) skyrockets as you move deeper into the network (higher layer indices). The errors accumulate.
In contrast, ApiQ (green line) keeps the activation error near zero throughout the entire depth of the model. By solving for activations rather than weights, ApiQ keeps the model “on track” even at very low bit-widths.
2. Block-wise vs. Layer-wise
The paper proposes two implementation strategies for ApiQ:
- ApiQ-lw (Layer-wise): Optimizes one linear layer at a time. It is memory efficient (running on consumer GPUs) but slower because it must proceed sequentially through every layer of the network.
- ApiQ-bw (Block-wise): Optimizes an entire Transformer block (Attention + MLP) at once.
The block-wise objective looks like this:

ApiQ-bw is the recommended approach. It is significantly faster to execute and allows for optimizing parameters across the whole block simultaneously. It effectively combines the benefits of quantization calibration with adapter initialization.
3. Gradient-Based Optimization
Unlike LoftQ, which uses Singular Value Decomposition (SVD), ApiQ uses a gradient-based approach. It treats the quantization parameters themselves (scale \(s\) and zero-point \(z\)) as trainable parameters alongside matrices \(A\) and \(B\).
To make the quantization step differentiable (since rounding is a non-differentiable operation), they use a Straight-Through Estimator (STE), allowing gradients to flow through the quantization function during this initialization phase.
Why Initialization Matters: A Look at Distributions
You might wonder: “Why does initializing \(A\) and \(B\) cleverly matter so much? Can’t the model just learn the correct values during fine-tuning?”
For 4-bit models, yes, standard fine-tuning often recovers. But for 2-bit models, the initial distortion is too severe. Furthermore, the shape of the distribution matters for training stability.

Figure 5 compares the distributions of the parameters for LoftQ (left) and ApiQ (right).
- LoftQ: Notice the distribution of matrix \(B\) (orange). It often contains outliers and is asymmetric.
- ApiQ: The distributions for \(A\) and \(B\) are smooth, symmetric, and Gaussian-like.
Neural networks learn much better when parameters follow Gaussian distributions (a principle behind standard initialization techniques like Xavier or He initialization). ApiQ provides a much healthier optimization landscape for the subsequent fine-tuning phase.
Experiments and Results
The researchers tested ApiQ across a variety of models (Llama-2, Mistral, DeBERTa, RoBERTa) and tasks (Language Modeling, Glue, Arithmetic, Commonsense Reasoning).
1. Finetuning Performance (The Main Event)
The most striking results appear in the 2-bit and 3-bit regimes.

In Figure 1, examine the bottom-left chart (GSM8K Accuracy).
- At Bit=2, standard methods like QLoRA (and even LoftQ) collapse to near 0% accuracy or perform very poorly.
- ApiQ (Orange/Green bars) maintains significantly higher accuracy, bridging the gap toward 4-bit performance.
This is reinforced by the tables provided in the paper.

Looking at Table 6:
- For Llama-2-7B at 2-bit on GSM8K (Math):
- LoftQ achieves 20.9% accuracy.
- ApiQ-bw achieves 33.5% accuracy.
- For Mistral-7B at 2-bit:
- QLoRA and LoftQ basically fail (~2% accuracy).
- ApiQ-bw recovers to 45.0% accuracy.
This demonstrates that ApiQ effectively rescues the model from the “catastrophic forgetting” usually caused by aggressive quantization.
2. Post-Training Quantization (PTQ)
Even if you don’t plan to fine-tune, ApiQ works as an excellent Post-Training Quantization method (where you just quantize and run inference).

In Table 3, ApiQ compares favorably against dedicated quantization methods like GPTQ, AWQ, and OmniQuant. At 2-bits (Llama-2-7B), ApiQ achieves a perplexity of 7.59, significantly lower (better) than GPTQ (20.85) or AWQ (huge spike). This suggests that the initialized adapter matrices \(A\) and \(B\) are successfully capturing the information lost by the 2-bit weight compression.
3. Efficiency
Is this pre-initialization step expensive?

According to Table 4, quantizing Llama-2-7B with ApiQ-bw takes about 1.3 hours. While this is slower than GPTQ (0.2h), it is faster than OmniQuant and requires reasonable memory (12GB). Considering this is a one-time cost that enables viable 2-bit fine-tuning, the trade-off is highly favorable.
Weight Error vs. Activation Error
An interesting finding in the paper is the relationship between weight error and activation error.

Figure 3 shows the weight quantization error.
- Left: LoftQ reduces weight error better than QLoRA.
- Middle: ApiQ also reduces weight error significantly compared to QLoRA, even though its objective function targets activations.
This suggests that by optimizing for the output (activations), ApiQ implicitly finds a good weight configuration, but adds the extra benefit of handling layer-to-layer error propagation.
Conclusion & Implications
The “ApiQ” paper identifies a critical bottleneck in the lifecycle of Large Language Models: the disconnect between quantization and fine-tuning.
By treating the initialization of Low-Rank Adapters (LoRA) and the quantization of weights as a joint optimization problem centered on preserving activations, ApiQ achieves two major milestones:
- It halts the “snowball effect” of quantization errors propagating through deep networks.
- It creates a mathematically sound “starting point” for fine-tuning.
The implications are significant for students and researchers with limited hardware. If 2-bit quantization becomes viable without destroying model intelligence, we could see 7B and 13B parameter models running and training on devices with drastically less memory than is currently required. ApiQ is a strong step toward democratizing access to these powerful models.
Key Takeaways
- Don’t just look at weights: When compressing models, preserving the activations (outputs) prevents error accumulation.
- Initialization is Key: For low-bit (2-bit/3-bit) training, standard initialization fails. You need to calibrate your adapters before you start fine-tuning.
- Block-wise Optimization: Processing the network in transformer blocks is the sweet spot for efficiency and performance.
](https://deep-paper.org/en/paper/2402.05147/images/cover.png)