Oxford Study Finds Nearly Half of AI Benchmarks Lack Scientific Rigor and Oversell Performance
Oxford researchers discover nearly half of 445 AI benchmarks lack scientific rigor and oversell performance, with many failing to define what they measure or conduct proper statistical analysis, prompting calls for stricter evaluation standards.