Cursor Engineers Slash Tool Call Errors by 10x Using AI-Driven Evals and Custom Model Tuning

May 01, 2026

Cursor

Summary

Cursor engineers have slashed unexpected tool call errors by 10x in a single sprint by combining AI-driven evaluations, A/B testing, anomaly detection, and custom model tuning to sharpen their agent harness and prepare for a multi-agent future.

Key Points

Cursor engineers are continuously refining their agent harness by combining vision-driven development with quantitative evals, A/B testing, and real-usage metrics like code 'Keep Rate' and user satisfaction signals to measure and improve agent performance.
The team actively monitors tool call errors, classifies them by type, and uses anomaly detection alerts alongside automated weekly log reviews to catch regressions and drive bugs down — reducing unexpected tool call errors by an order of magnitude in a recent sprint.
The harness is deeply customized per model — accounting for differences in tool formats, prompting styles, and even quirky behaviors like 'context anxiety' — and is being built to support a multi-agent future where orchestration across specialized agents will be the core challenge.

Cursor Engineers Slash Tool Call Errors by 10x Using AI-Driven Evals and Custom Model Tuning

Summary

Key Points

Tags