Qwen2-VL: To See the World More Clearly
Qwen2-VL released with SOTA visual understanding across benchmarks and 20+ minute video comprehension, expanding the Qwen2 family into multimodal frontier.
Excerpt
DEMO GITHUB HUGGING FACE MODELSCOPE API DISCORD
After a year’s relentless efforts, today we are thrilled to release Qwen2-VL! Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:
SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
Read at source: https://qwenlm.github.io/blog/qwen2-vl/