New DeepSWE Benchmark Tests AI Coding Agents on 113 Real-World Software Engineering Tasks

May 27, 2026
GitHub
Article image for New DeepSWE Benchmark Tests AI Coding Agents on 113 Real-World Software Engineering Tasks

Summary

A new benchmark called DeepSWE launches to rigorously test AI coding agents on 113 real-world software engineering tasks across five programming languages, using isolated Docker environments and behavior-based grading to evaluate frontier models like Claude Code, Codex, and Gemini CLI.

Key Points

  • DeepSWE is a new benchmark designed to evaluate frontier coding agents on 113 original, long-horizon software engineering tasks drawn from active open-source repositories, spanning TypeScript, Go, Python, JavaScript, and Rust.
  • Each task follows the Harbor task format, including a prompt, isolated Docker environment, and a program-based verifier that grades solutions based on observable behavior rather than internal code structure, with a reference solution available only for human and AI reviewers.
  • The benchmark runs via Pier, a sandboxed coding-agent evaluation framework that supports multiple AI models and agents, including Claude Code, Codex, and Gemini CLI, with options for parallel execution on Modal and flexible task sampling.

Tags

Read Original Article