Show HN: We built an LLM inference engine in pure Python – no PyTorch, no Triton

Category: infrastructure

Tags: llm-inference, gpu-kernels, zero-dependency

Score: 8.3/10 (Innovation: 8, Technical: 9, Documentation: 8, Utility: 8)

ZSE is a zero-dependency LLM inference engine written in pure Python with its own kernel compiler for CUDA, ROCm, and Metal, achieving extremely fast cold starts and low memory usage compared to vLLM. It integrates built-in RAG, LoRA hot-swap, and a full server stack, making it a compelling alternative for efficient GPU serving without the bloat of PyTorch or Triton.

Target audience: backend devs, devops, ml-engineers

Repository: https://github.com/Zyora-Dev/zse/releases/tag/v2.0.0 · Python · 151 stars

View on Hacker News