SWE-bench Verified

Category: agents · Unit: % · Last refreshed

Real-world GitHub issues; human-verified subset. Gold standard for coding agents.

Top 25 models

RankModelScoreCaptured
1 Claude Opus 4.5 79.2% 2025-12-15
2 Doubao Seed Code 78.8% 2025-09-28
3 Gemini 3 Pro 77.4% 2025-11-20
4 Claude Sonnet 4 76.8% 2025-08-04
5 Gemini 3 Flash 75.8% 2026-02-17
6 MiniMax M2.5 75.8% 2026-02-17
7 Claude Opus 4.6 75.6% 2026-02-17
8 Claude Sonnet 4.5 74.8% 2025-11-03
9 GPT-5 74.4% 2025-10-15
10 Claude Opus 4 73.2% 2025-05-22
11 GPT-5.2 72.8% 2026-02-19
12 GLM-5 72.8% 2026-02-17
13 Kimi K2 71.2% 2025-10-14
14 DeepSeek V3 70.0% 2026-02-17
15 Qwen 3 Coder 69.6% 2025-08-05
16 GLM-4.6 68.2% 2025-09-30
17 Claude Haiku 4.5 66.6% 2026-02-17
18 Claude 3.7 Sonnet 66.4% 2025-05-14
19 GPT-5.1 66.0% 2025-11-24
20 Claude 3.5 Sonnet 62.8% 2025-02-28
21 MiniMax M2 61.0% 2025-11-24
22 Gemini 2.5 Pro 53.6% 2025-07-26
23 Gemini 2.0 Flash 52.2% 2024-12-12
24 o4-mini 45.0% 2025-07-26
25 o3-mini 42.4% 2025-02-14