Show HN: HermesBench – workflow reliability evals for personal AI agents

Category: ai-ml

Tags: benchmark, ai-agent, evaluation-harness

Score: 6.0/10 (Innovation: 6, Technical: 7, Documentation: 6, Utility: 5)

HermesBench is a reliability-first benchmark and evaluation harness for personal AI agent configurations, targeting Hermes Agent setups. It provides 27 workflow recipes across 9 categories to test agent reliability in real-world tasks like calendar, email, and finance. Its interesting approach separates driver and target adapters, uses deterministic checks plus LLM judgment, and emphasizes agent-driven workflows.

Target audience: AI engineers, agent developers

Repository: https://verkyyi.github.io/hermesbench/ · HTML · MIT · 1 stars

View on Hacker News