bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents
Despite the existence of numerous Optical Character Recognition (OCR) tools, the lack of comprehensive open-source systems hampers the progress of document digitization in various low-resource languages, including Bengali. Low-resource languages, especially those with an alphasyllabary writing syste...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Despite the existence of numerous Optical Character Recognition (OCR) tools,
the lack of comprehensive open-source systems hampers the progress of document
digitization in various low-resource languages, including Bengali. Low-resource
languages, especially those with an alphasyllabary writing system, suffer from
the lack of large-scale datasets for various document OCR components such as
word-level OCR, document layout extraction, and distortion correction; which
are available as individual modules in high-resource languages. In this paper,
we introduce Bengali$.$AI-BRACU-OCR (bbOCR): an open-source scalable document
OCR system that can reconstruct Bengali documents into a structured searchable
digitized format that leverages a novel Bengali text recognition model and two
novel synthetic datasets. We present extensive component-level and system-level
evaluation: both use a novel diversified evaluation dataset and comprehensive
evaluation metrics. Our extensive evaluation suggests that our proposed
solution is preferable over the current state-of-the-art Bengali OCR systems.
The source codes and datasets are available here:
https://bengaliai.github.io/bbocr. |
---|---|
DOI: | 10.48550/arxiv.2308.10647 |