Show HN: Best setup local LLM found for a 5090

Category: ai-ml

Tags: llm, inference, llama.cpp

Score: 7.3/10 (Innovation: 7, Technical: 9, Documentation: 8, Utility: 6)

This project provides a detailed guide and configuration for running a 35B Mixture of Experts LLM (Qwen 3.6) with a 450k token context window on a single 32GB GPU using llama.cpp, TurboQuant, and YaRN scaling. It is interesting for pushing the boundaries of local LLM inference on consumer hardware, balancing extreme context length with practical memory constraints, though with acknowledged trade-offs in retrieval accuracy.

Target audience: backend devs, data engineers

Repository: https://local-llm.utop.workers.dev/

View on Hacker News