New DeepSWE Benchmark Tests AI Coding Agents on 113 Real-World Software Engineering Tasks
Summary
A new benchmark called DeepSWE launches to rigorously test AI coding agents on 113 real-world software engineering tasks across five programming languages, using isolated Docker environments and behavior-based grading to evaluate frontier models like Claude Code, Codex, and Gemini CLI.
Key Points
- DeepSWE is a new benchmark designed to evaluate frontier coding agents on 113 original, long-horizon software engineering tasks drawn from active open-source repositories, spanning TypeScript, Go, Python, JavaScript, and Rust.
- Each task follows the Harbor task format, including a prompt, isolated Docker environment, and a program-based verifier that grades solutions based on observable behavior rather than internal code structure, with a reference solution available only for human and AI reviewers.
- The benchmark runs via Pier, a sandboxed coding-agent evaluation framework that supports multiple AI models and agents, including Claude Code, Codex, and Gemini CLI, with options for parallel execution on Modal and flexible task sampling.