TriAttention Compresses KV Cache by 10.7x and Doubles Inference Speed Without Sacrificing Accuracy
Summary
TriAttention, a new open-source method, compresses KV cache by 10.7x and doubles inference speed on long reasoning tasks while maintaining full attention accuracy, using trigonometric frequency-domain compression and integrating seamlessly with vLLM for easy local deployment on memory-constrained GPUs.
Key Points
- TriAttention is a new open-source method that compresses KV cache by 10.7x and boosts inference throughput by 2.5x on long reasoning tasks like AIME25, all while matching full attention accuracy.
- The system uses trigonometric frequency-domain compression of pre-RoPE Q/K vectors to score and retain only the most relevant KV cache entries, eliminating the need for costly representative query selection used in existing methods.
- TriAttention integrates seamlessly with vLLM via an auto-discovered plugin, exposes an OpenAI-compatible API, and enables local deployment of large models on memory-constrained GPUs like the 24GB RTX 4090 through OpenClaw compatibility.