Wafer-Scale AI Chips Achieve 2,700 Tokens Per Second, 10x Faster Than Traditional GPU Systems

Oct 26, 2025
ACM SIGOPS
Article image for Wafer-Scale AI Chips Achieve 2,700 Tokens Per Second, 10x Faster Than Traditional GPU Systems

Summary

Revolutionary wafer-scale AI chips deliver breakthrough performance of 2,700 tokens per second—10 times faster than traditional GPU systems—by integrating hundreds of thousands of cores with massive on-chip memory, achieving sub-millisecond inference latency through new PLMR optimization model.

Key Points

  • Wafer-scale AI chips integrate hundreds of thousands of cores with massive on-chip memory onto a single wafer, offering 100-1000x more memory bandwidth and communication efficiency than traditional multi-chip systems
  • Researchers introduce the PLMR model (Parallelism, Latency, Memory, Routing) to address key challenges in wafer-scale computing, including non-uniform memory access and constrained routing resources that current AI software stacks cannot handle effectively
  • The WaferLLM system demonstrates sub-millisecond-per-token inference latency on wafer-scale hardware, achieving 2,700 tokens/s compared to 260 tokens/s on 8-GPU systems, enabling efficient test-time scaling for AI applications

Tags

Read Original Article