Why we no longer evaluate SWE-bench Verified

OpenAI Blog ·

OpenAI analysis finds SWE-bench Verified increasingly contaminated with training leakage, recommending SWE-bench Pro for future coding agent evaluation.

Categories: Research

Excerpt

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

Discussions