Cursor Engineers Slash Tool Call Errors by 10x Using AI-Driven Evals and Custom Model Tuning

May 01, 2026
Cursor
Article image for Cursor Engineers Slash Tool Call Errors by 10x Using AI-Driven Evals and Custom Model Tuning

Summary

Cursor engineers have slashed unexpected tool call errors by 10x in a single sprint by combining AI-driven evaluations, A/B testing, anomaly detection, and custom model tuning to sharpen their agent harness and prepare for a multi-agent future.

Key Points

  • Cursor engineers are continuously refining their agent harness by combining vision-driven development with quantitative evals, A/B testing, and real-usage metrics like code 'Keep Rate' and user satisfaction signals to measure and improve agent performance.
  • The team actively monitors tool call errors, classifies them by type, and uses anomaly detection alerts alongside automated weekly log reviews to catch regressions and drive bugs down — reducing unexpected tool call errors by an order of magnitude in a recent sprint.
  • The harness is deeply customized per model — accounting for differences in tool formats, prompting styles, and even quirky behaviors like 'context anxiety' — and is being built to support a multi-agent future where orchestration across specialized agents will be the core challenge.

Tags

Read Original Article