Show HN: Glq LLM quantization using E8 lattice
Category: ai-ml
Tags: llm-quantization, e8-lattice, cuda-kernel, model-compression
Score: 8.3/10 (Innovation: 8, Technical: 9, Documentation: 8, Utility: 8)
GLQ is a post-training quantization method for large language models that uses E8 lattice codebooks to compress weights to 2-8 bits without significant quality loss. It combines Randomized Hadamard Transform, LDLQ error feedback, and fused CUDA kernels to achieve near-bf16 throughput, making it a practical and innovative approach for deploying LLMs efficiently.
Target audience: machine learning engineers, ai researchers, backend devs
Repository: https://github.com/cnygaard/glq · Python · Apache-2.0 · 3 stars
View on Hacker News