Show HN: Gandalf the Grader

Category: devtools

Tags: agent-judge, verification, evaluation, agent-environments, rubric-grading

Score: 7.5/10 (Innovation: 7, Technical: 8, Documentation: 8, Utility: 7)

Gandalf the Grader is a reactive agent-as-judge for rubric-graded agent environments that runs inside the same environment as the rollout agent to grade criteria based on stateful artifacts and tool outputs. Its design choices of environment alignment, reactive verification, and swappable domain guidance offer a novel approach that outperforms traditional verifiers on cost and accuracy. The project is interesting for its practical application to agent evaluation in RL and benchmark settings.

Target audience: AI researchers, RL engineers, and agent framework developers

Repository: https://github.com/Handshake-AI-Research/gandalf-the-grader · Python · Apache-2.0 · 11 stars

View on Hacker News