Neural OCR Post-Hoc Correction of Historical Corpora

Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for variations, , or (i.e., new , ), as the main source of , , or transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan qual...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Transactions of the Association for Computational Linguistics 2021-01, Vol.9, p.479-493
Hauptverfasser: Lyu, Lijun, Koutraki, Maria, Krickl, Martin, Fetahu, Besnik
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for variations, , or (i.e., new , ), as the main source of , , or transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
ISSN:2307-387X
2307-387X
DOI:10.1162/tacl_a_00379