Introducing SWE-bench Verified
OpenAI releases SWE-bench Verified, a human-validated subset of the software engineering benchmark to reduce false positives and provide more reliable evaluation of code模型的真实能力.
Excerpt
We’re releasing a human-validated subset of SWE-bench that more reliably evaluates AI models’ ability to solve real-world software issues.
Read at source: https://openai.com/index/introducing-swe-bench-verified