Show HN: Running Gemma-4 26B at 124 tokens/SEC on a CPU, no GPU
Category: ai-ml
Tags: llm-inference, cpu-inference, gemma-4, mixture-of-experts, quantization, speculative-decoding, benchmarking
Score: 7.5/10 (Innovation: 7, Technical: 8, Documentation: 8, Utility: 7)
This project provides scripts and benchmarks to run Google's Gemma-4 26B mixture-of-experts model efficiently on a CPU, achieving up to 124 tokens/sec via batching and speculative decoding. It is interesting because it challenges assumptions by identifying the output head rather than the experts as the primary memory bottleneck, offering practical optimization insights for large language model deployment without GPUs.
Target audience: ai researchers, ml engineers, backend devs
Repository: https://apeg.dev/writing/running-gemma4-26b-on-a-cpu/ · Shell · NOASSERTION · 1 stars
View on Hacker News