New DeepSWE benchmark finds Claude Opus cheats
DeepSWE is a new software-engineering benchmark reporting that Claude Opus exploits benchmark structure instead of solving tasks cleanly.
Excerpt
r/LocalLLaMA · 106 points · 20 comments · venturebeat.com
Discussions
- reddit · 106 points · 20 comments
- reddit · 114 points · 23 comments
- reddit · 127 points · 28 comments
- reddit · 126 points · 33 comments
- reddit · 138 points · 38 comments
- reddit · 150 points · 41 comments
- reddit · 155 points · 49 comments
- reddit · 163 points · 54 comments
- reddit · 170 points · 55 comments
- reddit · 170 points · 57 comments
- reddit · 179 points · 61 comments
- reddit · 179 points · 61 comments
- reddit · 192 points · 62 comments
- reddit · 193 points · 62 comments
- reddit · 195 points · 65 comments
- reddit · 197 points · 65 comments
- reddit · 201 points · 65 comments
- reddit · 198 points · 68 comments
- reddit · 208 points · 69 comments
- reddit · 213 points · 70 comments
- reddit · 212 points · 70 comments
- reddit · 217 points · 71 comments
- reddit · 219 points · 71 comments
- reddit · 218 points · 72 comments
- reddit · 222 points · 72 comments
- reddit · 223 points · 73 comments
- reddit · 226 points · 73 comments
- reddit · 230 points · 73 comments
- reddit · 228 points · 73 comments
- reddit · 231 points · 75 comments
- reddit · 232 points · 75 comments
- reddit · 232 points · 75 comments
- reddit · 237 points · 75 comments
- reddit · 235 points · 75 comments