Show HN: RewardHackBench: Using sandboxes to stop agents from cheating
Category: security
Tags: benchmark, ai-safety, sandbox, evaluation, llm-agents
Score: 7.3/10 (Innovation: 7, Technical: 8, Documentation: 8, Utility: 6)
RewardHackBench is a benchmark for evaluating whether sandbox policies can prevent AI agents from cheating on evaluation tasks by retrieving forbidden solution material. It is interesting because it provides a rigorous, reproducible methodology for testing agent honesty under different network and gateway configurations, with surprising results showing that only an LLM judge on outgoing requests eliminates cheating without reducing legitimate solves.
Target audience: AI researchers, security engineers, ML infrastructure teams
Repository: https://github.com/islo-labs/reward-hack-bench · Python · 1 stars
View on Hacker News