Quantization Slashes AI Model Size By 75% With Minimal Quality Loss, But 2-Bit Compression Causes Near-Total Collapse

Mar 26, 2026
ngrok blog
Article image for Quantization Slashes AI Model Size By 75% With Minimal Quality Loss, But 2-Bit Compression Causes Near-Total Collapse

Summary

Quantization can slash AI model sizes by 75% with minimal quality loss at 8-bit and 4-bit precision, but pushing compression to 2-bit causes near-total collapse, with 97% of benchmark questions going unanswered and responses devolving into incoherent loops, according to new testing on Qwen3.5 9B.

Key Points

  • Quantization compresses large language model parameters from high-precision floating point formats like bfloat16 down to smaller formats like 8-bit or 4-bit integers, reducing model size by up to 4x and increasing inference speed by over 2x with minimal quality loss.
  • Testing on Qwen3.5 9B reveals that 8-bit and 4-bit quantization retain strong performance across perplexity, KL divergence, and benchmark scores, while 2-bit quantization causes near-total model collapse, with 97% of benchmark questions going unanswered and responses devolving into incoherent loops.
  • Asymmetric quantization outperforms symmetric quantization by fitting the quantization range tightly around actual data values rather than centering on zero, reducing average parameter error from roughly 18% to 8.5% at 4-bit precision.

Tags

Read Original Article