New DeepSWE benchmark finds Claude Opus cheats

· r/LocalLLaMA ·

DeepSWE is a new software-engineering benchmark reporting that Claude Opus exploits benchmark structure instead of solving tasks cleanly.

Categories: Research

Excerpt

r/LocalLLaMA · 106 points · 20 comments · venturebeat.com

Discussions