AI model benchmark leaderboards

Track frontier AI model rankings across SWE-bench, LMArena, Terminal-Bench, METR, coding, reasoning, and agent benchmarks in one crawlable leaderboard index.

AA Coding Index code

Artificial Analysis composite coding score: equally-weighted average of SciCode, Terminal-Bench Hard, and LiveCodeBench. Higher = better.

#ModelScore
1 GPT-5 59.1%
2 GPT-5.4 57.2%
3 Claude Opus 4 56.7%
4 Gemini 3.1 Pro 55.5%
5 GPT-5.3 Codex 53.1%
6 Claude Opus 4.7 53.1%
7 Claude Sonnet 4.6 50.9%
8 GPT-5.2 48.7%
9 Claude Opus 4.6 48.1%
10 Claude Opus 4.5 47.8%

latest 2026-06-09 · upstream leaderboard

Chatbot Arena chat

Crowdsourced human preference Elo ratings. All models.

#ModelScore
1 Claude Opus 4.6 1501
2 Claude Opus 4.7 1488
3 Gemini 3.1 Pro 1482
4 Gemini 3 Pro 1480
5 GPT-5.4 1471
6 GLM-5 1470
7 GPT-5 1468
8 Gemini 3 Flash 1466
9 Claude Opus 4 1461
10 Gemini 2.5 Pro 1457

latest 2026-06-05 · upstream leaderboard

Chatbot Arena · Open Weights chat

LMArena Elo filtered to non-proprietary licenses. Where OSS actually stands.

#ModelScore
1 GLM-5 1470
2 Kimi K2 1455
3 GLM-4.6 1441
4 GLM-4.7 1436
5 Mistral Large 1430
6 DeepSeek R1 1428
7 DeepSeek V3 1424
8 DeepSeek V3.1 1420
9 Qwen 3 235B 1419
10 Qwen 3 Coder 1356

latest 2026-06-05 · upstream leaderboard

METR Time Horizon agents

Length of task (in minutes) an AI can complete at ~50% success, from METR's Time Horizon 1.1 suite (HCAST + SWAA). Longer = more capable.

#ModelScore
1 Claude Opus 4.6 11h59m
2 Gemini 3.1 Pro 6h24m
3 GPT-5.2 5h52m
4 GPT-5.3 Codex 5h50m
5 GPT-5.4 5h42m
6 Claude Opus 4.5 4h53m
7 Gemini 3 Pro 3h44m
8 GPT-5.1 Codex Max 3h44m
9 GPT-5 3h23m
10 o3 1h60m

latest 2026-03-05 · upstream leaderboard

OpenRouter · Weekly Usage usage

Where developers are actually spending tokens this week. OpenRouter's top-weekly ranking. Real-world adoption signal, not capability.

#ModelScore
1 Claude Sonnet 4.6 #7
2 Claude Opus 4.7 #8
3 Claude Opus 4 #9
4 DeepSeek V3 #10
5 Gemini 3 Flash #11
6 Gemini 2.5 Flash #13
7 Claude Opus 4.6 #19
8 GPT-5 #20
9 Gemini 3.1 Flash #23
10 GPT-4o mini #24

latest 2026-06-09 · upstream leaderboard

SWE-bench Verified agents

Real-world GitHub issues; human-verified subset. Gold standard for coding agents.

#ModelScore
1 Claude Opus 4.5 79.2%
2 Doubao Seed Code 78.8%
3 Gemini 3 Pro 77.4%
4 Claude Sonnet 4 76.8%
5 Gemini 3 Flash 75.8%
6 MiniMax M2.5 75.8%
7 Claude Opus 4.6 75.6%
8 Claude Sonnet 4.5 74.8%
9 GPT-5 74.4%
10 Claude Opus 4 73.2%

latest 2026-02-17 · upstream leaderboard

Terminal-Bench 2.0 agents

Agents completing real tasks in an actual terminal/shell environment.

#ModelScore
1 GPT-5.1 100.0%
2 Claude Opus 4.7 89.9%
3 Gemini 3.1 Pro 88.8%
4 GPT-5 88.1%
5 Claude Opus 4.6 82.0%
6 GPT-5.3 Codex 82.0%
7 GPT-5.4 81.8%
8 Gemini 3 Flash 71.0%
9 Gemini 3 Pro 67.4%
10 GPT-5.2 65.8%

latest 2026-05-01 · upstream leaderboard