Why we no longer evaluate SWE-bench Verified
OpenAI analysis finds SWE-bench Verified increasingly contaminated with training leakage, recommending SWE-bench Pro for future coding agent evaluation.
Excerpt
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.
Read at source: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified
Discussions
- reddit · 135 points · 35 comments
- reddit · 161 points · 41 comments
- reddit · 180 points · 49 comments
- reddit · 201 points · 53 comments
- reddit · 221 points · 55 comments
- reddit · 234 points · 60 comments
- reddit · 252 points · 67 comments
- reddit · 262 points · 68 comments
- reddit · 271 points · 71 comments
- reddit · 282 points · 72 comments
- reddit · 302 points · 74 comments
- reddit · 314 points · 79 comments
- reddit · 327 points · 84 comments
- reddit · 332 points · 85 comments
- reddit · 342 points · 87 comments
- reddit · 357 points · 88 comments
- reddit · 367 points · 88 comments
- reddit · 373 points · 89 comments
- reddit · 373 points · 89 comments
- reddit · 378 points · 89 comments
- reddit · 384 points · 90 comments
- reddit · 386 points · 92 comments
- reddit · 392 points · 93 comments
- reddit · 396 points · 93 comments
- reddit · 411 points · 93 comments
- reddit · 409 points · 94 comments
- reddit · 408 points · 96 comments
- reddit · 411 points · 96 comments
- reddit · 417 points · 96 comments
- reddit · 421 points · 99 comments
- reddit · 415 points · 100 comments
- reddit · 422 points · 100 comments
- reddit · 423 points · 101 comments
- reddit · 425 points · 101 comments
- reddit · 430 points · 102 comments
- reddit · 433 points · 102 comments
- reddit · 435 points · 102 comments
- reddit · 446 points · 102 comments