TriAttention Compresses KV Cache by 10.7x and Doubles Inference Speed Without Sacrificing Accuracy

Apr 08, 2026
GitHub
Article image for TriAttention Compresses KV Cache by 10.7x and Doubles Inference Speed Without Sacrificing Accuracy

Summary

TriAttention, a new open-source method, compresses KV cache by 10.7x and doubles inference speed on long reasoning tasks while maintaining full attention accuracy, using trigonometric frequency-domain compression and integrating seamlessly with vLLM for easy local deployment on memory-constrained GPUs.

Key Points

  • TriAttention is a new open-source method that compresses KV cache by 10.7x and boosts inference throughput by 2.5x on long reasoning tasks like AIME25, all while matching full attention accuracy.
  • The system uses trigonometric frequency-domain compression of pre-RoPE Q/K vectors to score and retain only the most relevant KV cache entries, eliminating the need for costly representative query selection used in existing methods.
  • TriAttention integrates seamlessly with vLLM via an auto-discovered plugin, exposes an OpenAI-compatible API, and enables local deployment of large models on memory-constrained GPUs like the 24GB RTX 4090 through OpenClaw compatibility.

Tags

Read Original Article