Large Language Models (LLMs) have transformed artificial intelligence, enabling breathtaking capabilities in text generation, reasoning, and understanding. But this power comes with a heavy price: gigantic model sizes, high memory demands, and substantial computational costs. Deploying these models efficiently—especially on constrained hardware—is a major engineering challenge.

One of the most effective tools to shrink these models is quantization—reducing the precision of model weights from high-precision floating-point numbers (like bfloat16) to low-precision integers (like int4). This can slash memory usage by 4× or more, enabling powerful models to run on consumer-grade hardware.

But quantization is tricky. Aggressive reduction in precision can severely hurt accuracy, especially with the simplest “calibration-free” methods that skip using example datasets during quantization.

A key troublemaker here is the outlier. Imagine a single, unusually large weight value in a matrix. Classic quantization methods must stretch their entire numeric range to fit it, which means all other values get represented with coarser steps—leading to significant errors. It’s like designing a painting where one huge corner is dedicated to a giant sun, forcing all other details into a cramped space.

Researchers from Huawei Zurich Research Center address this pain point in their paper, “SINO: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights”. They introduce SINQ (Sinkhorn Normalized Quantization), a calibration-free method that elegantly handles outliers by adding a second set of scaling factors. This gives SINQ the ability to balance quantization errors in a way that single-scale methods cannot. Let’s unpack how it works.

The Trouble with Outliers: A Quantization Primer

In Post-Training Quantization (PTQ), the aim is to represent a high-precision weight matrix \(\mathbf{W}\) with a low-precision integer matrix \(\mathbf{Q}\), plus a small set of higher-precision parameters to restore values to the original range.

The most common approach uses one scale vector (\(\vec{s}\)) and a zero-point (\(\vec{z}\)) per tile or group of weights:

\[ \mathbf{W}_{\text{approx}} = \vec{s} \odot (\mathbf{Q} + \vec{z}) \]

Here, \(\odot\) denotes element-wise multiplication. The scale \(\vec{s}\) determines quantization step size, and \(\vec{z}\) centers that range. When a group contains a single large outlier, its scale must be oversized to represent it—forcing large steps and large errors for all other numbers in that group.

Strategies to combat this include:

  1. Calibration: Methods like AWQ use sample data to identify important weights and preserve them with higher precision. Effective, but adds complexity, needs data, and can be slow.
  2. Non-Uniform Quantization: Schemes like NF4 place quantization levels unevenly, focusing where most weight values lie. Better accuracy but may be less hardware-friendly.
  3. Weight Space Transformations: Rotations like the Hadamard transform spread out outliers in the weight space, making them easier to quantize.

SINQ takes a different path: keeping uniform quantization, avoiding calibration, and still achieving results competitive with more sophisticated approaches.

The Core Method: Dual-Scaling with SINQ

The key insight:
Instead of controlling the quantization range along just one dimension (rows or columns), control it along both.

A New Way to Parameterize: Dual-Scaling

SINQ replaces the single scale vector with two:

  • Row-wise scale vector: \(\vec{s}\)
  • Column-wise scale vector: \(\vec{t}\)
\[ \mathbf{W}_{\text{approx}} = \vec{s} \odot \mathbf{Q} \odot \vec{t} \]

Or, with optional shift vector \(\vec{z}\):

\[ \mathbf{W}_{\text{approx}} = \vec{s} \odot (\mathbf{Q} + \vec{z}) \odot \vec{t} \]

This unlocks new flexibility. Suppose there’s an outlier at position \((i, j)\). Single-scale methods must enlarge the scale for an entire row or column, hurting all other values there.
With dual scaling, we increase \(s_i\) for that row while decreasing \(t_j\) for its column—splitting the error between row and column and keeping it contained.

If we have scales along both dimensions of a matrix that is to be quantized, we can trade off the impact of outliers between rows and columns, which is impossible in single-scale quantization. Left: Conceptual illustration of quantization error distributions with single or dual-scaling. Right: Example on small matrix.

Figure 1: Dual scaling allows shifting outlier impact between rows and columns, in contrast to single-scaling which forces it into one dimension.

Finding the Scales: The Matrix Imbalance Metric

Dual scales are powerful, but how do we choose them optimally? The authors define a proxy metric: matrix imbalance.

Imbalance \(I(\mathbf{W})\) is:

\[ I(\mathbf{W}) = \frac{\tilde{\sigma}_{\max}(\mathbf{W})}{\tilde{\sigma}_{\min}(\mathbf{W})} \]

where \(\tilde{\sigma}_{\max}\) is the maximum standard deviation across all rows and columns, and \(\tilde{\sigma}_{\min}\) is the corresponding minimum.

A low-imbalance matrix has similar spread in every row and column—making it easier to quantize, since no single dimension requires a disproportionately large scale.

The Sinkhorn-Knopp Connection

Directly minimizing imbalance is tricky—the max/min operations mess with gradient-based optimization. The authors draw inspiration from the Sinkhorn-Knopp algorithm, which usually normalizes row and column sums.

SINQ adapts this: instead of sums, it iteratively normalizes standard deviations.

Algorithm sketch:

  1. Start with original weights \(\mathbf{W}\).
  2. Repeat for \(n\) iterations:
    • Divide each column by its std. deviation.
    • Divide each row by its new std. deviation.
  3. Quantize the balanced matrix \(\hat{\mathbf{W}}\) using RTN or similar; the row & column normalization factors become the dual scales \(\vec{s}\) and \(\vec{t}\).

This converges quickly and yields nearly identical std. deviations across rows and columns—minimizing imbalance.

Results on Qwen3-1.7B. Minimizing the imbalance with our algorithm (a and b) decreases both the imbalance and the kurtosis. Minimizing the kurtosis directly (d and e) yields lower kurtosis but large imbalance, harming accuracy (c, f).

Figure 2: SINQ’s imbalance optimization sharply reduces kurtosis and improves perplexity; focusing solely on kurtosis worsens imbalance and performance.

Playing Well with Others: A-SINQ

SINQ is modular—it meshes with other techniques. The authors demonstrate A-SINQ, combining SINQ with AWQ calibration:

  1. Apply SINQ normalization to balance the matrix.
  2. Apply AWQ scales to emphasize important weights.
  3. Quantize.

This preserves AWQ’s activation-awareness while benefiting from SINQ’s outlier management.

Experiments: Putting SINQ to the Test

The team evaluated SINQ across Qwen3 and DeepSeek models using:

  • Perplexity (lower is better; measures language modeling quality)
  • Flip rate (lower is better; measures prediction changes vs. full-precision baseline)

Uncalibrated, Uniform Quantization

SINQ’s sweet spot.
In 3-bit and 4-bit scenarios, SINQ consistently achieved the lowest perplexity (Table 1), outperforming RTN, Hadamard+RTN, and HQQ. In some cases, it halved the gap to the full-precision baseline.

Table 1: Weight-only uncalibrated uniform PTQ on Qwen3 models with 3-bit and 4-bit quantization, reporting perplexity and actual memory usage (GB). Lower is better for all metrics. The best result for a given setting is marked in bold.

Flip rates tell the same story: SINQ changes the model’s answers less often than competitors, meaning predictions are closer to the original.

Table 2: Flip rates (%) on HellaSwag, PIQA, and MMLU for Qwen3 models with 3-bit and 4-bit quantization. Lower is better. The best result for a given setting is marked in bold.

In the memory–perplexity Pareto plot (Fig. 3), SINQ sits on or near the optimal trade-off curve—low memory and low perplexity.

Figure 3: Pareto plot of memory vs. WikiText2 perplexity for Qwen3 models. SINQ is shown to be on or near the Pareto-optimal front for both 4-bit methods (a) and a mix of bit widths (b).

SINQ also scales up to huge architectures like DeepSeek-V2.5-236B MoE, maintaining robustness.

Table 3: Weight-only PTQ on large MoE models, showing SINQ’s strong performance on 3-bit and 4-bit quantization.

Compatibility with Other Techniques

  • Non-Uniform Quantization: Pairing SINQ with NF4 improves the baseline NF4 (Table 4). For the 32B model, uniform INT4 SINQ beats NF4—a testament to the core normalization’s power.
  • Calibrated Quantization: A-SINQ (SINQ + AWQ) often sets new state-of-the-art results (Table 5). In some cases, calibration-free SINQ beats calibrated methods like GPTQ.

Table 4: SINQ combined with non-uniform NF4 quantization, showing improved performance over the NF4 baseline.

Table 5: SINQ combined with calibrated AWQ (A-SINQ), demonstrating that it further improves performance and often sets the state of the art.

Speed Matters: Quantization Time

Speed is a major advantage. SINQ needs only ~1.1× the time of RTN—far faster than HQQ (>2×) and dramatically quicker than AWQ (>30×) or GPTQ.

Figure 5: A plot showing the distribution of quantization times for various methods on the Qwen3-32B model. SINQ is extremely fast, comparable to RTN, while calibrated methods like AWQ and GPTQ are much slower.

Conclusion and Takeaways

The SINQ paper presents a simple, elegant approach to LLM quantization that is:

  • Powerful: Dual scaling isolates outlier impact, resulting in higher accuracy.
  • Smart: Matrix imbalance is an effective optimization target for easier quantization.
  • Fast: State-of-the-art uncalibrated results with runtimes close to RTN—ideal when time or data is scarce.
  • Versatile: Works with non-uniform levels and calibrated schemes for even better performance.

In a landscape where methods are growing more complex, SINQ shows that a thoughtful, well-targeted idea can narrow the gap between basic calibration-free quantization and elaborate techniques—making high-performance LLMs smaller, faster, and more accessible.