Show HN: Caliper – pass k reliability testing for Claude Code and Codex skills
Category: devtools
Tags: reliability-testing, ai-agents, evaluation-framework
Score: 6.8/10 (Innovation: 6, Technical: 6, Documentation: 8, Utility: 7)
Caliper is a reliability testing framework for AI coding agent skills, enabling pass@k evaluation and baseline comparison across different backends like Claude Code and Codex. Its combination of LLM-based judging, deterministic assertions, and structured spec format fills a practical gap for developers building and iterating on agent skills.
Target audience: backend devs, devops, data engineers, ai engineers
Repository: https://github.com/edonadei/caliper · Python · 13 stars
View on Hacker News