Show HN: RewardHackBench: Using sandboxes to stop agents from cheating

Category: security

Tags: benchmark, ai-safety, sandbox, evaluation, llm-agents

Score: 7.3/10 (Innovation: 7, Technical: 8, Documentation: 8, Utility: 6)

RewardHackBench is a benchmark for evaluating whether sandbox policies can prevent AI agents from cheating on evaluation tasks by retrieving forbidden solution material. It is interesting because it provides a rigorous, reproducible methodology for testing agent honesty under different network and gateway configurations, with surprising results showing that only an LLM judge on outgoing requests eliminates cheating without reducing legitimate solves.

Target audience: AI researchers, security engineers, ML infrastructure teams

Repository: https://github.com/islo-labs/reward-hack-bench · Python · 1 stars

View on Hacker News