Terminal-Bench 2.0
Agents completing real tasks in an actual terminal/shell environment.
Top 21 models
| Rank | Model | Score | Captured |
|---|---|---|---|
| 1 | GPT-5.1 | 100.0% | 2026-01-15 |
| 2 | Claude Opus 4.7 | 89.9% | 2026-05-01 |
| 3 | Gemini 3.1 Pro | 88.8% | 2026-03-08 |
| 4 | GPT-5 | 88.1% | 2026-04-30 |
| 5 | Claude Opus 4.6 | 82.0% | 2026-03-28 |
| 6 | GPT-5.3 Codex | 82.0% | 2026-03-26 |
| 7 | GPT-5.4 | 81.8% | 2026-03-07 |
| 8 | Gemini 3 Flash | 71.0% | 2026-03-06 |
| 9 | Gemini 3 Pro | 67.4% | 2026-01-25 |
| 10 | GPT-5.2 | 65.8% | 2026-02-10 |
| 11 | Grok 4 | 61.8% | 2026-04-01 |
| 12 | GLM-5 | 60.0% | 2026-04-03 |
| 13 | Claude Opus 4.5 | 58.4% | 2026-01-16 |
| 14 | Claude Sonnet 4.6 | 52.1% | 2026-04-17 |
| 15 | Claude Sonnet 4.5 | 42.7% | 2025-11-13 |
| 16 | MiniMax M2 | 42.5% | 2026-05-02 |
| 17 | Kimi K2 | 42.5% | 2026-01-27 |
| 18 | MiniMax M2.5 | 42.2% | 2026-02-17 |
| 19 | DeepSeek V3 | 39.6% | 2026-02-08 |
| 20 | GLM-4.7 | 33.3% | 2026-02-07 |
| 21 | Qwen 3 Coder | 27.2% | 2025-12-26 |
Upstream leaderboard: https://www.tbench.ai/leaderboard