Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA [D]
Benchmark of 6 document QA approaches on MMLongBench-Doc finds vision LLMs (Claude Sonnet) rank 5th in accuracy at 52% and cost the most at $0.26/query, underperforming OCR-based pipelines on image-heavy PDFs.
Excerpt
I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc ([https://github.com/mayubo2333/MMLongBench-Doc](https://github.com/mayubo2333/MMLongBench-Doc)). There were 171 questions in total, using Claude Sonnet 4.5 as the LLM.
Post-retry results:
|Approach|Accuracy|$/query|
|:-|:-|:-|
|LlamaCloud premium + full-context|59.6%|$0.1885|
|Azure premium + full-context|58.5%|$0.2051|
|Azure basic + full-context|54.4%|$0.1062|
|Agentic RAG|53.2%|$0.0827|
|**Native PDF (vision LLM)**|**52.0%**|**$0.2552**|
|LlamaCloud basic + full-context|50.9%|$0.1049|
Native PDF came 5th of 6 on accuracy and was the most expensive arm at $0.2552 per query.
Two findings:
Vision underperformed on chart-heavy and table-heavy pages, the territory that the "vision LLMs make OCR obsolete" claim most often points to. Premium OCR with layout extraction held up better there.
The native-PDF arm had a 7% intrinsic failure rate (related to PDF file size) that survived retries. There were 27 first-pass failures, with 5 attempts of exponential backoff per failed query. Fifteen recovered, and 12 stayed permanently broken. These were concentrated in two specific PDFs that fail for predictable transport-layer reasons (the blog identifies them). OCR-based arms had a 0% intrinsic failure rate after retries.
Caveats: 30 docs is a small sample. I ran McNemar's pairwise test to determine which gaps
Read at source: https://www.reddit.com/r/MachineLearning/comments/1tm0cqg/visioncapable_llms_vs_ocr_for_longdocument/