Google DeepMind Launches ProEval, Cutting AI Evaluation Costs by Up to 100x With Open-Source Bayesian Tool
Summary
Google DeepMind launches ProEval, a free open-source tool that slashes generative AI evaluation costs by up to 100x using Bayesian Quadrature techniques, achieving ±1% accuracy with a fraction of typical samples while proactively identifying model failure patterns across major benchmarks.
Key Points
- Google DeepMind releases ProEval, an open-source tool on GitHub designed to slash generative AI evaluation costs by up to 100x while proactively surfacing model failure patterns using Bayesian Quadrature techniques.
- ProEval achieves approximately ±1% accuracy using only a fraction of typical evaluation samples and supports transfer learning via pre-trained Gaussian Process surrogates that generalize instantly to new models across benchmarks like GSM8K, MMLU, and StrategyQA.
- The tool is easy to integrate into existing GenAI evaluation pipelines, is licensed under Apache 2.0, and is accompanied by a technical report published on arXiv (arXiv:2604.23099) for researchers looking to cite the work.