397-Billion Parameter AI Model Runs on MacBook Pro With 48GB RAM at 4.4 Tokens Per Second Using Custom C/Metal Engine
Summary
A custom C/Metal inference engine called Flash-MoE is now running a massive 397-billion parameter AI model on a standard MacBook Pro with 48GB RAM, streaming 209GB directly from SSD at 4.4 tokens per second — with 58 documented experiments revealing that Apple Silicon's unified memory architecture defies conventional optimization wisdom.
Key Points
- A custom C/Metal inference engine called Flash-MoE successfully runs the 397-billion parameter Qwen3.5-397B-A17B model on a MacBook Pro with just 48GB of RAM, achieving 4.4+ tokens per second with full tool-calling support by streaming the 209GB model directly from SSD.
- Key performance breakthroughs include an FMA-optimized GPU dequantization kernel delivering a 12% speed boost, Accelerate BLAS for linear attention providing a 64% improvement, and a 'Trust the OS' caching strategy that outperforms every custom caching solution tested by leveraging the OS page cache's natural ~71% expert hit rate.
- Built entirely in C, Objective-C, and hand-tuned Metal shaders with no Python frameworks, the project documents 58 experiments revealing that many intuitive optimizations like LZ4 compression, prefetching, and speculative routing actually hurt performance due to unified memory architecture constraints on Apple Silicon.