IndexCache Cuts DeepSeek Sparse Attention Compute by 75%, Delivering Up to 1.82× Faster AI Inference on H100 GPUs

Mar 16, 2026
GitHub
Article image for IndexCache Cuts DeepSeek Sparse Attention Compute by 75%, Delivering Up to 1.82× Faster AI Inference on H100 GPUs

Summary

IndexCache slashes DeepSeek Sparse Attention compute by 75% by reusing token indices across transformer layers, delivering up to 1.82× faster AI inference on H100 GPUs with negligible quality loss and zero extra memory overhead, now available as a patch for SGLang and vLLM.

Key Points

  • IndexCache is a new inference acceleration method that eliminates up to 75% of redundant indexer computations in DeepSeek Sparse Attention (DSA) by reusing top-k token indices across adjacent transformer layers, achieving up to 1.82× prefill and 1.48× decode speedup on H100 hardware.
  • The system works by partitioning layers into Full layers that run their own indexer and Shared layers that reuse cached indices from nearby Full layers, with both training-free and training-aware approaches available, requiring only 1/4 of indexers while maintaining negligible quality degradation across benchmarks.
  • IndexCache is now available as a patch for SGLang and vLLM, supporting models including DeepSeek-V3.2 and GLM-5 (744B), and can be configured with a simple parameter to control how frequently layers retain their indexer, with zero additional GPU memory overhead required.

Tags

Read Original Article