Cursor Engineers Slash Tool Call Errors by 10x Using AI-Driven Evals and Custom Model Tuning
Summary
Cursor engineers have slashed unexpected tool call errors by 10x in a single sprint by combining AI-driven evaluations, A/B testing, anomaly detection, and custom model tuning to sharpen their agent harness and prepare for a multi-agent future.
Key Points
- Cursor engineers are continuously refining their agent harness by combining vision-driven development with quantitative evals, A/B testing, and real-usage metrics like code 'Keep Rate' and user satisfaction signals to measure and improve agent performance.
- The team actively monitors tool call errors, classifies them by type, and uses anomaly detection alerts alongside automated weekly log reviews to catch regressions and drive bugs down — reducing unexpected tool call errors by an order of magnitude in a recent sprint.
- The harness is deeply customized per model — accounting for differences in tool formats, prompting styles, and even quirky behaviors like 'context anxiety' — and is being built to support a multi-agent future where orchestration across specialized agents will be the core challenge.