IndexCache Cuts DeepSeek Sparse Attention Compute by 75%, Delivering Up to 1.82× Faster AI Inference on H100 GPUs

Mar 16, 2026

GitHub

Article image for IndexCache Cuts DeepSeek Sparse Attention Compute by 75%, Delivering Up to 1.82× Faster AI Inference on H100 GPUs

Summary

IndexCache slashes DeepSeek Sparse Attention compute by 75% by reusing token indices across transformer layers, delivering up to 1.82× faster AI inference on H100 GPUs with negligible quality loss and zero extra memory overhead, now available as a patch for SGLang and vLLM.

Key Points

IndexCache is a new inference acceleration method that eliminates up to 75% of redundant indexer computations in DeepSeek Sparse Attention (DSA) by reusing top-k token indices across adjacent transformer layers, achieving up to 1.82× prefill and 1.48× decode speedup on H100 hardware.
The system works by partitioning layers into Full layers that run their own indexer and Shared layers that reuse cached indices from nearby Full layers, with both training-free and training-aware approaches available, requiring only 1/4 of indexers while maintaining negligible quality degradation across benchmarks.
IndexCache is now available as a patch for SGLang and vLLM, supporting models including DeepSeek-V3.2 and GLM-5 (744B), and can be configured with a simple parameter to control how frequently layers retain their indexer, with zero additional GPU memory overhead required.

IndexCache Cuts DeepSeek Sparse Attention Compute by 75%, Delivering Up to 1.82× Faster AI Inference on H100 GPUs

Summary

Key Points

Tags