Show HN: Verdict – model evals on your own data, not someone else's benchmark

Category: devtools

Tags: llm-evaluation, benchmarking, python, ai-ml, devtools

Score: 5.8/10 (Innovation: 5, Technical: 6, Documentation: 7, Utility: 5)

Verdict is a Python framework for benchmarking LLMs against user-provided datasets, with pluggable metrics and support for multiple model providers. It offers a practical approach to model evaluation and improvement tracking, but is early-stage with limited adoption.

Target audience: backend devs, data engineers, ai researchers

Repository: https://github.com/aevyraai/verdict · Python · NOASSERTION · 2 stars

View on Hacker News