We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

By TimoKerre

· r/MachineLearning · Apr 23, 2026

Benchmark of 18 LLMs on OCR with 7k+ calls finds smaller/older models match premium accuracy at a fraction of the cost; dataset and framework open-sourced.

Categories: Research

Excerpt

**TLDR;** We were overpaying for OCR, so we compared flagship models with cheaper and older models. New mini-bench + leaderboard. Free tool to test your own documents. Open Source. We’ve been looking at OCR / document extraction workflows and kept seeing the same pattern: Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model. We put together a curated set of 42 standard documents and ran every model 10 times under identical conditions; 7,560 total calls. Main takeaway: for standard OCR, smaller and older models match premium accuracy at a fraction of the cost. We track pass\^n (reliability at scale), cost-per-success, latency, and critical field accuracy. Everything is open source: [https://github.com/ArbitrHq/ocr-mini-bench](https://github.com/ArbitrHq/ocr-mini-bench) Leaderboard: [https://arbitrhq.ai/leaderboards/](https://arbitrhq.ai/leaderboards/) Curious whether this matches what others here are seeing.

Read at source: https://www.reddit.com/r/MachineLearning/comments/1st9v81/we_benchmarked_18_llms_on_ocr_7k_calls_cheaperold/

Discussions

reddit · 50 points · 20 comments
reddit · 50 points · 21 comments
reddit · 51 points · 25 comments