Show HN: OS Megakernel that match M5 Max Tok/w at 2x the Throughput on RTX 3090
Category: ai-ml
Tags: llm-inference, cuda-optimization, kernel-fusion
Score: 7.3/10 (Innovation: 8, Technical: 9, Documentation: 7, Utility: 5)
Luce Megakernel is a single fused CUDA kernel that processes all 24 layers of Qwen 3.5-0.8B in one dispatch, eliminating inter-layer kernel launch overhead. It demonstrates that architecture-specific optimization can make a 2020 RTX 3090 match Apple M5 Max's energy efficiency while delivering 2x throughput. This is particularly innovative as it targets hybrid DeltaNet/Attention architectures, an emerging pattern in next-gen models.
Target audience: ai-engineers, ml-researchers, gpu-developers
Repository: https://github.com/Luce-Org/luce-megakernel · Cuda · 2 stars
View on Hacker News