OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such a...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large models have recently played a dominant role in natural language
processing and multimodal vision-language learning. However, their
effectiveness in text-related visual tasks remains relatively unexplored. In
this paper, we conducted a comprehensive evaluation of Large Multimodal Models,
such as GPT4V and Gemini, in various text-related visual tasks including Text
Recognition, Scene Text-Centric Visual Question Answering (VQA),
Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten
Mathematical Expression Recognition (HMER). To facilitate the assessment of
Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we
propose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29
datasets, making it the most comprehensive OCR evaluation benchmark available.
Furthermore, our study reveals both the strengths and weaknesses of these
models, particularly in handling multilingual text, handwritten text,
non-semantic text, and mathematical expression recognition. Most importantly,
the baseline results presented in this study could provide a foundational
framework for the conception and assessment of innovative strategies targeted
at enhancing zero-shot multimodal techniques. The evaluation pipeline and
benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR. |
---|---|
DOI: | 10.48550/arxiv.2305.07895 |