Show HN: New Benchmark from SWE-bench team is 0% solved

Category: ai-ml

Tags: benchmark, ai-agents, code-generation

Score: 7.3/10 (Innovation: 7, Technical: 9, Documentation: 7, Utility: 6)

ProgramBench is a benchmark from the SWE-bench team that tests whether AI agents can rebuild complete programs from scratch using only a compiled binary and documentation, without source code or internet access. It features 200 tasks ranging from small utilities like jq to large projects like SQLite, with over 248,000 behavioral tests, making it a uniquely challenging and novel evaluation for code generation and software design capabilities.

Target audience: AI researchers, machine learning engineers, and deep learning practitioners working on code generation and software engineering benchmarks

Repository: https://programbench.com/ · Python · MIT · 391 stars

View on Hacker News