Google Releases Gemma 4 Speed Boost Tech, Delivering Up to 3x Faster AI Token Generation

May 07, 2026
Ars Technica
Article image for Google Releases Gemma 4 Speed Boost Tech, Delivering Up to 3x Faster AI Token Generation

Summary

Google releases Multi-Token Prediction drafters for Gemma 4, delivering up to 3x faster AI token generation speeds with zero quality loss, now available under the Apache 2.0 license and compatible with major frameworks including MLX, VLLM, SGLang, and Ollama.

Key Points

  • Google releases Multi-Token Prediction (MTP) drafters for its Gemma 4 open AI models, delivering up to 3x faster token generation speeds without any loss in output quality.
  • The speed boost works by using a lightweight draft model to speculatively predict future tokens during idle compute cycles, which are then verified in parallel by the main Gemma model, allowing multiple tokens to be accepted in a single forward pass.
  • The MTP drafters are now available under the permissive Apache 2.0 license and are compatible with popular frameworks including MLX, VLLM, SGLang, and Ollama, with speed gains ranging from 2.5x on Apple M4 silicon to over 3x on mobile devices running smaller models.

Tags

Read Original Article