TriAttention Compresses KV Cache by 10.7x and Doubles Inference Speed Without Sacrificing Accuracy
TriAttention, a new open-source method, compresses KV cache by 10.7x and doubles inference speed on long reasoning tasks while maintaining full attention accuracy, using trigonometric frequency-domain compression and integrating seamlessly with vLLM for easy local deployment on memory-constrained GPUs.