Show HN: We built an LLM inference engine in pure Python – no PyTorch, no Triton
Category: infrastructure
Tags: llm-inference, gpu-kernels, zero-dependency
Score: 8.3/10 (Innovation: 8, Technical: 9, Documentation: 8, Utility: 8)
ZSE is a zero-dependency LLM inference engine written in pure Python with its own kernel compiler for CUDA, ROCm, and Metal, achieving extremely fast cold starts and low memory usage compared to vLLM. It integrates built-in RAG, LoRA hot-swap, and a full server stack, making it a compelling alternative for efficient GPU serving without the bloat of PyTorch or Triton.
Target audience: backend devs, devops, ml-engineers
Repository: https://github.com/Zyora-Dev/zse/releases/tag/v2.0.0 · Python · 151 stars
View on Hacker News