Show HN: OS Megakernel that match M5 Max Tok/w at 2x the Throughput on RTX 3090

Category: ai-ml

Tags: llm-inference, cuda-optimization, kernel-fusion

Score: 7.3/10 (Innovation: 8, Technical: 9, Documentation: 7, Utility: 5)

Luce Megakernel is a single fused CUDA kernel that processes all 24 layers of Qwen 3.5-0.8B in one dispatch, eliminating inter-layer kernel launch overhead. It demonstrates that architecture-specific optimization can make a 2020 RTX 3090 match Apple M5 Max's energy efficiency while delivering 2x throughput. This is particularly innovative as it targets hybrid DeltaNet/Attention architectures, an emerging pattern in next-gen models.

Target audience: ai-engineers, ml-researchers, gpu-developers

Repository: https://github.com/Luce-Org/luce-megakernel · Cuda · 2 stars

View on Hacker News