New ErdosBench Benchmark Launches to Test AI Systems on Complex Mathematical Research Problems
Summary
ErdosBench launches as a new benchmark challenging AI systems to tackle complex Erdős-style mathematical research problems, featuring a public smoke test with 14 problems and a full private corpus of 226 problems with blind evaluation tools.
Key Points
- ErdosBench is a newly released research-mathematics benchmark designed to evaluate whether AI systems can function as useful mathematical research assistants by tackling Erdős-style problems, including finding obstructions, checking proof gaps, and running finite experiments.
- The public smoke test repository contains 14 publicly available problem statements (AI-ERDOS-001 through AI-ERDOS-012, plus AI-ERDOS-195 and AI-ERDOS-208), while the full 226-problem corpus, private splits, answer keys, and verifier internals remain undisclosed.
- Researchers can run blind evaluations using provided scripts to generate prompts, collect model outputs in JSONL format, and validate results against a required schema, with allowed verdicts ranging from 'solved' to 'no_progress.'