NVIDIA Unveils Open RDMA Protocol With AMD, Microsoft, and OpenAI to Supercharge Gigascale AI Networking
Summary
NVIDIA, alongside AMD, Microsoft, and OpenAI, unveils a groundbreaking open RDMA transport protocol called Multipath Reliable Connection (MRC), designed to revolutionize gigascale AI networking by distributing traffic across multiple paths for faster, more resilient large-scale AI training clusters.
Key Points
- NVIDIA Spectrum-X Ethernet is setting the standard for gigascale AI networking, with industry leaders like OpenAI, Microsoft, and Oracle deploying it to power some of the world's largest AI training clusters.
- A new RDMA transport protocol called Multipath Reliable Connection (MRC) is now being released as an open specification through the Open Compute Project, enabling a single RDMA connection to distribute traffic across multiple network paths for improved throughput, load balancing, and resilience.
- MRC, developed in collaboration with AMD, Broadcom, Intel, Microsoft, and OpenAI, uses hardware-speed failure detection and intelligent retransmission to keep thousands of GPUs synchronized, minimizing disruptions and maximizing efficiency during large-scale AI training runs.