OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set
Since the dawn of the computing era, information has been represented digitally so that it can be processed by electronic computers. Paper books and documents were abundant and widely being published at that time; and hence, there was a need to convert them into digital format. OCR, short for Optica...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Since the dawn of the computing era, information has been represented
digitally so that it can be processed by electronic computers. Paper books and
documents were abundant and widely being published at that time; and hence,
there was a need to convert them into digital format. OCR, short for Optical
Character Recognition was conceived to translate paper-based books into digital
e-books. Regrettably, OCR systems are still erroneous and inaccurate as they
produce misspellings in the recognized text, especially when the source
document is of low printing quality. This paper proposes a post-processing OCR
context-sensitive error correction method for detecting and correcting non-word
and real-word OCR errors. The cornerstone of this proposed approach is the use
of Google Web 1T 5-gram data set as a dictionary of words to spell-check OCR
text. The Google data set incorporates a very large vocabulary and word
statistics entirely reaped from the Internet, making it a reliable source to
perform dictionary-based error correction. The core of the proposed solution is
a combination of three algorithms: The error detection, candidate spellings
generator, and error correction algorithms, which all exploit information
extracted from Google Web 1T 5-gram data set. Experiments conducted on scanned
images written in different languages showed a substantial improvement in the
OCR error correction rate. As future developments, the proposed algorithm is to
be parallelised so as to support parallel and distributed computing
architectures. |
---|---|
DOI: | 10.48550/arxiv.1204.0188 |