SakanaAI Launches Open-Source Sparse LLM Kernels Promising Faster Inference and Lower Memory on H100 GPUs

May 11, 2026
GitHub
Article image for SakanaAI Launches Open-Source Sparse LLM Kernels Promising Faster Inference and Lower Memory on H100 GPUs

Summary

SakanaAI releases 'sparser-faster-llms,' an open-source repository of custom CUDA kernels for H100 GPUs that use sparsity to dramatically speed up LLM inference and cut memory usage, with pretrained models from 0.5B to 2B parameters now available on Hugging Face Hub.

Key Points

  • SakanaAI releases 'sparser-faster-llms,' an open-source repository featuring custom CUDA kernels optimized for H100 GPUs that leverage sparsity in large language models to boost inference throughput and reduce memory usage during both training and inference.
  • The project introduces the TwELL packing format with two kernel variants — standard 'twell' and 'twell-flex' for non-uniform sparsity — and ships pretrained sparse checkpoints ranging from 0.5B to 2B parameters available on Hugging Face Hub.
  • Developers can benchmark the sparse kernels against standard PyTorch references, train their own sparse models using the provided multi-GPU launch scripts with DeepSpeed integration, and optionally measure GPU energy consumption during inference runs.

Tags

Read Original Article