Google Releases Open-Source AI Drafters That Triple Gemma 4 Inference Speed Without Sacrificing Quality
Summary
Google releases open-source Multi-Token Prediction drafters for Gemma 4 models, tripling AI inference speed through speculative decoding with zero quality loss, now freely available on Hugging Face and Kaggle under an Apache 2.0 license.
Key Points
- Google releases Multi-Token Prediction (MTP) drafters for Gemma 4 models, delivering up to a 3x inference speedup with zero degradation in output quality or reasoning accuracy.
- The MTP drafters use speculative decoding, pairing a lightweight drafter model with the heavier Gemma 4 target model to predict multiple tokens simultaneously, which the target model then verifies in parallel, dramatically reducing latency.
- The MTP drafters are now available under an open-source Apache 2.0 license on Hugging Face, Kaggle, and multiple inference frameworks, supporting use cases from on-device edge deployment to high-performance workstation and cloud environments.