Terminal-Bench 2.0

Category: agents · Unit: % · Last refreshed

Agents completing real tasks in an actual terminal/shell environment.

Top 21 models

RankModelScoreCaptured
1 GPT-5.1 100.0% 2026-01-15
2 Claude Opus 4.7 89.9% 2026-05-01
3 Gemini 3.1 Pro 88.8% 2026-03-08
4 GPT-5 88.1% 2026-04-30
5 Claude Opus 4.6 82.0% 2026-03-28
6 GPT-5.3 Codex 82.0% 2026-03-26
7 GPT-5.4 81.8% 2026-03-07
8 Gemini 3 Flash 71.0% 2026-03-06
9 Gemini 3 Pro 67.4% 2026-01-25
10 GPT-5.2 65.8% 2026-02-10
11 Grok 4 61.8% 2026-04-01
12 GLM-5 60.0% 2026-04-03
13 Claude Opus 4.5 58.4% 2026-01-16
14 Claude Sonnet 4.6 52.1% 2026-04-17
15 Claude Sonnet 4.5 42.7% 2025-11-13
16 MiniMax M2 42.5% 2026-05-02
17 Kimi K2 42.5% 2026-01-27
18 MiniMax M2.5 42.2% 2026-02-17
19 DeepSeek V3 39.6% 2026-02-08
20 GLM-4.7 33.3% 2026-02-07
21 Qwen 3 Coder 27.2% 2025-12-26