New ErdosBench Benchmark Launches to Test AI Systems on Complex Mathematical Research Problems

Jun 16, 2026
GitHub
Article image for New ErdosBench Benchmark Launches to Test AI Systems on Complex Mathematical Research Problems

Summary

ErdosBench launches as a new benchmark challenging AI systems to tackle complex Erdős-style mathematical research problems, featuring a public smoke test with 14 problems and a full private corpus of 226 problems with blind evaluation tools.

Key Points

  • ErdosBench is a newly released research-mathematics benchmark designed to evaluate whether AI systems can function as useful mathematical research assistants by tackling Erdős-style problems, including finding obstructions, checking proof gaps, and running finite experiments.
  • The public smoke test repository contains 14 publicly available problem statements (AI-ERDOS-001 through AI-ERDOS-012, plus AI-ERDOS-195 and AI-ERDOS-208), while the full 226-problem corpus, private splits, answer keys, and verifier internals remain undisclosed.
  • Researchers can run blind evaluations using provided scripts to generate prompts, collect model outputs in JSONL format, and validate results against a required schema, with allowed verdicts ranging from 'solved' to 'no_progress.'

Tags

Read Original Article