New AI Benchmark Tests Web Browsing Skills, Challenging Models with Complex Queries
Summary
BrowseComp, a novel AI benchmark, evaluates web browsing skills by posing 1,266 complex queries requiring synthesis of information across multiple websites, with a deep research model achieving 51.5% accuracy, surpassing other models' performance on this challenging task.
Key Points
- BrowseComp is a new benchmark that measures the ability of AI agents to locate hard-to-find information on the internet
- It consists of 1,266 challenging problems with short, verifiable answers that require browsing multiple websites to solve
- A deep research model trained for web browsing achieved 51.5% accuracy on BrowseComp, significantly outperforming other models