Google DeepMind Releases Gemma 4 QAT Models, Shrinking AI Memory Use to Under 1GB for Consumer Devices

Jun 08, 2026

Google

Summary

Google DeepMind releases Gemma 4 QAT models that shrink AI memory requirements to under 1GB, making powerful on-device AI accessible on everyday consumer hardware through advanced quantization techniques.

Key Points

Google DeepMind is releasing new Gemma 4 model checkpoints optimized with Quantization-Aware Training (QAT), dramatically reducing memory requirements while preserving model quality for use on everyday edge devices and consumer GPUs.
Unlike standard Post-Training Quantization, QAT integrates the compression process directly into training, and a custom mobile-specialized quantization schema reduces the Gemma 4 E2B memory footprint to under 1GB through techniques like static activations, channel-wise quantization, and targeted 2-bit compression.
The QAT models are available now on Hugging Face in multiple formats and are supported by popular developer tools including llama.cpp, Ollama, LM Studio, vLLM, SGLang, and MLX, enabling seamless local and on-device deployment across desktop and mobile platforms.

Google DeepMind Releases Gemma 4 QAT Models, Shrinking AI Memory Use to Under 1GB for Consumer Devices

Summary

Key Points

Tags