Show HN: Caliper – pass k reliability testing for Claude Code and Codex skills

Category: devtools

Tags: reliability-testing, ai-agents, evaluation-framework

Score: 6.8/10 (Innovation: 6, Technical: 6, Documentation: 8, Utility: 7)

Caliper is a reliability testing framework for AI coding agent skills, enabling pass@k evaluation and baseline comparison across different backends like Claude Code and Codex. Its combination of LLM-based judging, deterministic assertions, and structured spec format fills a practical gap for developers building and iterating on agent skills.

Target audience: backend devs, devops, data engineers, ai engineers

Repository: https://github.com/edonadei/caliper · Python · 13 stars

View on Hacker News