A knowledge-based recognition system for historical Mongolian documents

This paper proposes a knowledge-based system to recognize historical Mongolian documents in which the words exhibit remarkable variation and character overlapping. According to the characteristics of Mongolian word formation, the system combines a holistic scheme and a segmentation-based scheme for...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal on document analysis and recognition 2016-09, Vol.19 (3), p.221-235
Hauptverfasser: Su, Xiangdong, Gao, Guanglai, Wei, Hongxi, Bao, Feilong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper proposes a knowledge-based system to recognize historical Mongolian documents in which the words exhibit remarkable variation and character overlapping. According to the characteristics of Mongolian word formation, the system combines a holistic scheme and a segmentation-based scheme for word recognition. Several types of words and isolated suffixes that cannot be segmented into glyph-units or do not require segmentation are recognized using the holistic scheme. The remaining words are recognized using the segmentation-based scheme, which is the focus of this paper. We exploit the knowledge of the glyph characteristics to segment words into glyph-units in the segmentation-based scheme. Convolutional neural networks are employed not only for word recognition in the holistic scheme, but also for glyph-unit recognition in the segmentation-based scheme. Based on the analysis of recognition errors in the segmentation-based scheme, the system is enhanced by integrating three strategies into glyph-unit recognition. These strategies involve incorporating baseline information, glyph-unit grouping, and recognizing under-segmented and over-segmented fragments. The proposed system achieves 80.86 % word accuracy on the Mongolian Kanjur test samples.
ISSN:1433-2833
1433-2825
DOI:10.1007/s10032-016-0267-1