Introducing SWE-bench Verified

OpenAI Blog ·

OpenAI releases SWE-bench Verified, a human-validated subset of the software engineering benchmark to reduce false positives and provide more reliable evaluation of code模型的真实能力.

Categories: OSS & Tools, Research

Excerpt

We’re releasing a human-validated subset of SWE-bench that more reliably evaluates AI models’ ability to solve real-world software issues.