Show HN: Glq LLM quantization using E8 lattice

Category: ai-ml

Tags: llm-quantization, e8-lattice, cuda-kernel, model-compression

Score: 8.3/10 (Innovation: 8, Technical: 9, Documentation: 8, Utility: 8)

GLQ is a post-training quantization method for large language models that uses E8 lattice codebooks to compress weights to 2-8 bits without significant quality loss. It combines Randomized Hadamard Transform, LDLQ error feedback, and fused CUDA kernels to achieve near-bf16 throughput, making it a practical and innovative approach for deploying LLMs efficiently.

Target audience: machine learning engineers, ai researchers, backend devs

Repository: https://github.com/cnygaard/glq · Python · Apache-2.0 · 3 stars

View on Hacker News