Quantization Slashes AI Model Size By 75% With Minimal Quality Loss, But 2-Bit Compression Causes Near-Total Collapse

Mar 26, 2026

ngrok blog

Article image for Quantization Slashes AI Model Size By 75% With Minimal Quality Loss, But 2-Bit Compression Causes Near-Total Collapse

Summary

Quantization can slash AI model sizes by 75% with minimal quality loss at 8-bit and 4-bit precision, but pushing compression to 2-bit causes near-total collapse, with 97% of benchmark questions going unanswered and responses devolving into incoherent loops, according to new testing on Qwen3.5 9B.

Key Points

Quantization compresses large language model parameters from high-precision floating point formats like bfloat16 down to smaller formats like 8-bit or 4-bit integers, reducing model size by up to 4x and increasing inference speed by over 2x with minimal quality loss.
Testing on Qwen3.5 9B reveals that 8-bit and 4-bit quantization retain strong performance across perplexity, KL divergence, and benchmark scores, while 2-bit quantization causes near-total model collapse, with 97% of benchmark questions going unanswered and responses devolving into incoherent loops.
Asymmetric quantization outperforms symmetric quantization by fitting the quantization range tightly around actual data values rather than centering on zero, reducing average parameter error from roughly 18% to 8.5% at 4-bit precision.

Quantization Slashes AI Model Size By 75% With Minimal Quality Loss, But 2-Bit Compression Causes Near-Total Collapse

Summary

Key Points

Tags