Large Language Models (LLMs) have transformed artificial intelligence, enabling breathtaking capabilities in text generation, reasoning, and understanding. But this power comes with a heavy price: gigantic model sizes, high memory demands, and substantial computational costs. Deploying these models efficiently—especially on constrained hardware—is a major engineering challenge.
One of the most effective tools to shrink these models is quantization—reducing the precision of model weights from high-precision floating-point numbers (like bfloat16) to low-precision integers (like int4). This can slash memory usage by 4× or more, enabling powerful models to run on consumer-grade hardware.
But quantization is tricky. Aggressive reduction in precision can severely hurt accuracy, especially with the simplest “calibration-free” methods that skip using example datasets during quantization.
A key troublemaker here is the outlier. Imagine a single, unusually large weight value in a matrix. Classic quantization methods must stretch their entire numeric range to fit it, which means all other values get represented with coarser steps—leading to significant errors. It’s like designing a painting where one huge corner is dedicated to a giant sun, forcing all other details into a cramped space.
Researchers from Huawei Zurich Research Center address this pain point in their paper, “SINO: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights”. They introduce SINQ (Sinkhorn Normalized Quantization), a calibration-free method that elegantly handles outliers by adding a second set of scaling factors. This gives SINQ the ability to balance quantization errors in a way that single-scale methods cannot. Let’s unpack how it works.
The Trouble with Outliers: A Quantization Primer
In Post-Training Quantization (PTQ), the aim is to represent a high-precision weight matrix \(\mathbf{W}\) with a low-precision integer matrix \(\mathbf{Q}\), plus a small set of higher-precision parameters to restore values to the original range.
The most common approach uses one scale vector (\(\vec{s}\)) and a zero-point (\(\vec{z}\)) per tile or group of weights:
\[ \mathbf{W}_{\text{approx}} = \vec{s} \odot (\mathbf{Q} + \vec{z}) \]Here, \(\odot\) denotes element-wise multiplication. The scale \(\vec{s}\) determines quantization step size, and \(\vec{z}\) centers that range. When a group contains a single large outlier, its scale must be oversized to represent it—forcing large steps and large errors for all other numbers in that group.
Strategies to combat this include:
- Calibration: Methods like AWQ use sample data to identify important weights and preserve them with higher precision. Effective, but adds complexity, needs data, and can be slow.
- Non-Uniform Quantization: Schemes like NF4 place quantization levels unevenly, focusing where most weight values lie. Better accuracy but may be less hardware-friendly.
- Weight Space Transformations: Rotations like the Hadamard transform spread out outliers in the weight space, making them easier to quantize.
SINQ takes a different path: keeping uniform quantization, avoiding calibration, and still achieving results competitive with more sophisticated approaches.
The Core Method: Dual-Scaling with SINQ
The key insight:
Instead of controlling the quantization range along just one dimension (rows or columns), control it along both.
A New Way to Parameterize: Dual-Scaling
SINQ replaces the single scale vector with two:
- Row-wise scale vector: \(\vec{s}\)
- Column-wise scale vector: \(\vec{t}\)
Or, with optional shift vector \(\vec{z}\):
\[ \mathbf{W}_{\text{approx}} = \vec{s} \odot (\mathbf{Q} + \vec{z}) \odot \vec{t} \]This unlocks new flexibility. Suppose there’s an outlier at position \((i, j)\). Single-scale methods must enlarge the scale for an entire row or column, hurting all other values there.
With dual scaling, we increase \(s_i\) for that row while decreasing \(t_j\) for its column—splitting the error between row and column and keeping it contained.
Figure 1: Dual scaling allows shifting outlier impact between rows and columns, in contrast to single-scaling which forces it into one dimension.
Finding the Scales: The Matrix Imbalance Metric
Dual scales are powerful, but how do we choose them optimally? The authors define a proxy metric: matrix imbalance.
Imbalance \(I(\mathbf{W})\) is:
\[ I(\mathbf{W}) = \frac{\tilde{\sigma}_{\max}(\mathbf{W})}{\tilde{\sigma}_{\min}(\mathbf{W})} \]where \(\tilde{\sigma}_{\max}\) is the maximum standard deviation across all rows and columns, and \(\tilde{\sigma}_{\min}\) is the corresponding minimum.
A low-imbalance matrix has similar spread in every row and column—making it easier to quantize, since no single dimension requires a disproportionately large scale.
The Sinkhorn-Knopp Connection
Directly minimizing imbalance is tricky—the max/min operations mess with gradient-based optimization. The authors draw inspiration from the Sinkhorn-Knopp algorithm, which usually normalizes row and column sums.
SINQ adapts this: instead of sums, it iteratively normalizes standard deviations.
Algorithm sketch:
- Start with original weights \(\mathbf{W}\).
- Repeat for \(n\) iterations:
- Divide each column by its std. deviation.
- Divide each row by its new std. deviation.
- Quantize the balanced matrix \(\hat{\mathbf{W}}\) using RTN or similar; the row & column normalization factors become the dual scales \(\vec{s}\) and \(\vec{t}\).
This converges quickly and yields nearly identical std. deviations across rows and columns—minimizing imbalance.
Figure 2: SINQ’s imbalance optimization sharply reduces kurtosis and improves perplexity; focusing solely on kurtosis worsens imbalance and performance.
Playing Well with Others: A-SINQ
SINQ is modular—it meshes with other techniques. The authors demonstrate A-SINQ, combining SINQ with AWQ calibration:
- Apply SINQ normalization to balance the matrix.
- Apply AWQ scales to emphasize important weights.
- Quantize.
This preserves AWQ’s activation-awareness while benefiting from SINQ’s outlier management.
Experiments: Putting SINQ to the Test
The team evaluated SINQ across Qwen3 and DeepSeek models using:
- Perplexity (lower is better; measures language modeling quality)
- Flip rate (lower is better; measures prediction changes vs. full-precision baseline)
Uncalibrated, Uniform Quantization
SINQ’s sweet spot.
In 3-bit and 4-bit scenarios, SINQ consistently achieved the lowest perplexity (Table 1), outperforming RTN, Hadamard+RTN, and HQQ. In some cases, it halved the gap to the full-precision baseline.
Flip rates tell the same story: SINQ changes the model’s answers less often than competitors, meaning predictions are closer to the original.
In the memory–perplexity Pareto plot (Fig. 3), SINQ sits on or near the optimal trade-off curve—low memory and low perplexity.
SINQ also scales up to huge architectures like DeepSeek-V2.5-236B MoE, maintaining robustness.
Compatibility with Other Techniques
- Non-Uniform Quantization: Pairing SINQ with NF4 improves the baseline NF4 (Table 4). For the 32B model, uniform INT4 SINQ beats NF4—a testament to the core normalization’s power.
- Calibrated Quantization: A-SINQ (SINQ + AWQ) often sets new state-of-the-art results (Table 5). In some cases, calibration-free SINQ beats calibrated methods like GPTQ.
Speed Matters: Quantization Time
Speed is a major advantage. SINQ needs only ~1.1× the time of RTN—far faster than HQQ (>2×) and dramatically quicker than AWQ (>30×) or GPTQ.
Conclusion and Takeaways
The SINQ paper presents a simple, elegant approach to LLM quantization that is:
- Powerful: Dual scaling isolates outlier impact, resulting in higher accuracy.
- Smart: Matrix imbalance is an effective optimization target for easier quantization.
- Fast: State-of-the-art uncalibrated results with runtimes close to RTN—ideal when time or data is scarce.
- Versatile: Works with non-uniform levels and calibrated schemes for even better performance.
In a landscape where methods are growing more complex, SINQ shows that a thoughtful, well-targeted idea can narrow the gap between basic calibration-free quantization and elaborate techniques—making high-performance LLMs smaller, faster, and more accessible.