Nvidia Cuts LLM Memory Costs by 8x With New Dynamic Memory Sparsification Technique

Feb 12, 2026
Venturebeat
Article image for Nvidia Cuts LLM Memory Costs by 8x With New Dynamic Memory Sparsification Technique

Summary

Nvidia's new Dynamic Memory Sparsification technique slashes large language model memory costs by up to 8x while maintaining accuracy, enabling 5x higher throughput on models like Qwen3-8B and already available in Nvidia's Model Optimizer framework for rapid enterprise deployment on a single DGX H100.

Key Points

  • Nvidia researchers develop Dynamic Memory Sparsification (DMS), a technique that reduces large language model reasoning memory costs by up to 8x without sacrificing accuracy.
  • DMS works by training LLMs to intelligently identify and discard non-essential tokens from the KV cache using a 'delayed eviction' mechanism, allowing models to reason longer and deeper within the same memory budget.
  • Already released as part of Nvidia's Model Optimizer framework, DMS delivers up to 5x higher throughput on models like Qwen3-8B and requires no custom hardware, making it accessible for enterprise deployment within hours on a single DGX H100.

Tags

Read Original Article