MULDT: Multilingual Ultra-Lightweight Document Text Detection for Embedded Devices

Recent research on text detection has focused on scenes "in the wild", while there is still a demand for a fast and high-quality model for the document domain. Since document OCR is often run on embedded devices such as smartphones, scanners or even AR glasses, it also imposes great restri...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2024, Vol.12, p.170530-170540
Hauptverfasser: Gayer, Alexander, Arlazarov, Vladimir V.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Recent research on text detection has focused on scenes "in the wild", while there is still a demand for a fast and high-quality model for the document domain. Since document OCR is often run on embedded devices such as smartphones, scanners or even AR glasses, it also imposes great restrictions on the computational complexity, the amount of RAM used, and the power consumption of the models. In this work, we present MULDT - the multilingual ultra-lightweight document text detector for scans and photos of documents of any type and appearance, such as business documents or ID cards. Taking into account the main features of text detection on documents and the limitations of the task compared to text detection in the wild, we obtained an extremely small, but at the same time high-quality model. Extensive experiments on publicly available datasets of various documents such as MIDV-500, MIDV-2019, MIDV-2020, MIDV-LAIT, UrduDoc, SVRD, FUNSD, SROIE, and XFUND demonstrate the high versatility and quality of the proposed approach. MULDT achieves a recall score of 0.826 for MIDV-2019, compared to 0.792 for PaddleOCR and 0.718 for CRAFT, while being comparable or sometimes better in quality on other datasets. At the same time, MULDT is 89 times smaller than CRAFT, and 4 times smaller and 40% faster than PaddleOCR, making it an excellent choice for use on embedded devices. With 240k trainable parameters and a size of 890 KB, MULDT finds text in a 905\times 1280 image in just 129 milliseconds on the iPhone SE (2020) and 867 milliseconds on the iPhone 6 (2014).
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2024.3474616