Optical Character Recognition and text cleaning in the indigenous South African languages

This article represents follow-up work on unpublished presentations by the authors of text and corpus cleaning strategies for the African languages. In this article we provide a comparative description of cleaning of web-sourced and text-sourced material to be used for the compilation of corpora wit...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Stellenbosch papers in linguistics plus 2022-01, Vol.64
Hauptverfasser: Prinsloo, Danie, Taljard, Elsabé, Goosen, Michelle
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This article represents follow-up work on unpublished presentations by the authors of text and corpus cleaning strategies for the African languages. In this article we provide a comparative description of cleaning of web-sourced and text-sourced material to be used for the compilation of corpora with specific attention to cleaning of text-based material, since this is particularly relevant for the indigenous South African languages. For the purposes of this study, we use the term “web-sourced material” to refer to digital data sourced from the internet, whereas “text-based material” refers to hard copy textual material. We identify the different types of errors found in such texts, looking specifically at typical scanning errors in these languages, followed by an evaluation of three commercially available Optical Character Recognition (OCR) tools. We argue that the cleanness of texts is a matter of granularity, depending on the envisaged application of the corpus comprised by the texts. Text corpora which are to be utilized for e.g. lexicographic purposes can tolerate a higher level of ‘noise’ than those used for the compilation of e.g. spelling and grammar checkers. We conclude with some suggestions for text cleaning for the indigenous languages of South Africa.
ISSN:2224-3380
2224-3380
DOI:10.5842/64-1-867