Show HN: Verdict – model evals on your own data, not someone else's benchmark
Category: devtools
Tags: llm-evaluation, benchmarking, python, ai-ml, devtools
Score: 5.8/10 (Innovation: 5, Technical: 6, Documentation: 7, Utility: 5)
Verdict is a Python framework for benchmarking LLMs against user-provided datasets, with pluggable metrics and support for multiple model providers. It offers a practical approach to model evaluation and improvement tracking, but is early-stage with limited adoption.
Target audience: backend devs, data engineers, ai researchers
Repository: https://github.com/aevyraai/verdict · Python · NOASSERTION · 2 stars
View on Hacker News