Show HN: Running Gemma-4 26B at 124 tokens/SEC on a CPU, no GPU

Category: ai-ml

Tags: llm-inference, cpu-inference, gemma-4, mixture-of-experts, quantization, speculative-decoding, benchmarking

Score: 7.5/10 (Innovation: 7, Technical: 8, Documentation: 8, Utility: 7)

This project provides scripts and benchmarks to run Google's Gemma-4 26B mixture-of-experts model efficiently on a CPU, achieving up to 124 tokens/sec via batching and speculative decoding. It is interesting because it challenges assumptions by identifying the output head rather than the experts as the primary memory bottleneck, offering practical optimization insights for large language model deployment without GPUs.

Target audience: ai researchers, ml engineers, backend devs

Repository: https://apeg.dev/writing/running-gemma4-26b-on-a-cpu/ · Shell · NOASSERTION · 1 stars

View on Hacker News