Leading AI Models Achieve Only 24% Accuracy on Real White-Collar Work Tasks in New Benchmark Test
Summary
Leading AI models achieve only 24% accuracy on real white-collar tasks in new benchmark testing, with Gemini 3 Flash and GPT-5.2 struggling most when tracking information across multiple workplace tools like Slack and Google Drive in consulting, banking, and legal scenarios.
Key Points
- New APEX-Agents benchmark tests leading AI models on real white-collar tasks from consulting, investment banking, and law, with the best models achieving only 24% accuracy
- AI systems struggle most with tracking information across multiple domains and tools like Slack and Google Drive, which represents how professionals actually work
- Gemini 3 Flash leads performance at 24% accuracy followed by GPT-5.2 at 23%, while other models including Opus 4.5 and GPT-5 score around 18%