Netflix Engineers Uncover Linux Kernel Lock Bug Causing Container Stalls Lasting Tens of Seconds
Summary
Netflix engineers discover a Linux kernel lock bug causing container stalls lasting tens of seconds, tracing the issue to global mount lock contention triggered by thousands of concurrent bind mount operations, with hardware architecture playing a critical role and mitigations already underway.
Key Points
- Netflix engineers are revealing that container scaling bottlenecks on modern CPUs trace deep into the Linux kernel itself, where thousands of concurrent bind mount operations trigger severe contention on a global mount lock in the virtual filesystem, causing node stalls lasting tens of seconds.
- Hardware architecture is proving critical to the problem, as older dual-socket NUMA-based AWS instances dramatically worsen lock contention compared to newer single-socket Intel and AMD instances, with disabling hyperthreading alone improving latency by up to 30% in some configurations.
- Netflix is deploying two key mitigations: redesigning overlay filesystems to reduce per-container mount operations from O(n) to O(1) by grouping layers under a common parent, and routing workloads to CPU architectures that handle global locks more gracefully, underscoring the need for hardware-software co-design at scale.