AWS Launches llm-d Disaggregated Inference on SageMaker HyperPod and EKS, Boosting LLM Throughput by Up to 70%

Mar 18, 2026

Amazon Web Services

Article image for AWS Launches llm-d Disaggregated Inference on SageMaker HyperPod and EKS, Boosting LLM Throughput by Up to 70%

Summary

AWS launches llm-d disaggregated inference on SageMaker HyperPod and EKS, delivering up to 70% higher LLM throughput by splitting compute-heavy prefill and memory-bound decode phases across distributed GPUs using an open-source Kubernetes-native framework built on vLLM.

Key Points

AWS and the llm-d team are launching disaggregated inference capabilities on AWS, introducing a new container that includes AWS-specific libraries like Elastic Fabric Adapter and libfabric, enabling multi-node disaggregated inference and expert parallelism on Amazon SageMaker HyperPod and Amazon EKS.
llm-d is an open source, Kubernetes-native framework built on vLLM that separates the compute-bound prefill phase and memory-bound decode phase of LLM inference across distributed GPU resources, enabling cache-aware routing, tiered prefix caching, and wide expert parallelism for large MoE models.
Benchmarking results show that llm-d's prefill/decode disaggregation path increases tokens per second by up to 70% compared to a standard vLLM deployment at high concurrency, with performance gains driven by NIXL-powered KV cache transfers over EFA networking on AWS instances.

AWS Launches llm-d Disaggregated Inference on SageMaker HyperPod and EKS, Boosting LLM Throughput by Up to 70%

Summary

Key Points

Tags