Show HN: Best setup local LLM found for a 5090
Category: ai-ml
Tags: llm, inference, llama.cpp
Score: 7.3/10 (Innovation: 7, Technical: 9, Documentation: 8, Utility: 6)
This project provides a detailed guide and configuration for running a 35B Mixture of Experts LLM (Qwen 3.6) with a 450k token context window on a single 32GB GPU using llama.cpp, TurboQuant, and YaRN scaling. It is interesting for pushing the boundaries of local LLM inference on consumer hardware, balancing extreme context length with practical memory constraints, though with acknowledged trade-offs in retrieval accuracy.
Target audience: backend devs, data engineers
Repository: https://local-llm.utop.workers.dev/
View on Hacker News