Xiaomi Hits 1,000 Tokens Per Second on Trillion-Parameter AI, Claiming 15x Speed Advantage Over ChatGPT and Claude
Summary
Xiaomi and TileRT claim a major AI speed breakthrough, hitting over 1,000 tokens per second on a trillion-parameter model using just 8 commodity GPUs — roughly 15 times faster than ChatGPT and Claude — powered by FP4 quantization and speculative decoding, with an open-source model checkpoint already live on Hugging Face.
Key Points
- Xiaomi and inference partner TileRT have achieved over 1,000 tokens per second on a 1-trillion-parameter AI model using a standard 8-GPU commodity node, making it roughly 15 times faster than ChatGPT and Claude, which operate at around 68–71 tokens per second.
- The breakthrough relies on two key techniques: FP4 quantization, which compresses expert model layers to 4-bit precision with near-zero quality loss, and DFlash speculative decoding, which proposes and verifies entire blocks of tokens in a single pass rather than one at a time.
- A limited API trial for MiMo-V2.5-Pro-UltraSpeed runs June 9–23, priced at 3 times the standard MiMo rate for approximately 10 times the generation speed, with the FP4-DFlash model checkpoint already open-sourced on Hugging Face for community testing.