Real-Word Errors in Arabic Texts: A Better Algorithm for Detection and Correction

Real-word (also known as semantic or context-sensitive) spelling error is a class of error that escapes the typical spell checker which relies on dictionary look-up. This kind of error occurs when a user types a correctly spelled word-by mistake-when another is intended, e.g., "I want a peace (...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2019-08, Vol.27 (8), p.1308-1320
Hauptverfasser: Azmi, Aqil M., Almutery, Manal N., Aboalsamh, Hatim A.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Real-word (also known as semantic or context-sensitive) spelling error is a class of error that escapes the typical spell checker which relies on dictionary look-up. This kind of error occurs when a user types a correctly spelled word-by mistake-when another is intended, e.g., "I want a peace (piece) of cake." Further, these errors commonly arise in text written by people with dyslexia. Real-word errors are harder to detect as we need to consider the context. In this paper, we propose a spell checker that detects and corrects real-word errors for the Arabic language. Our method avoids predefined confusion sets-a simple approach used by many works tackling this problem-which limits the list of words that can be detected and corrected. Thus, our system can detect and correct a larger set of real-word errors. For the detection phase, we employ word and stem n-gram (n = 1-3) language model along with machine learning, achieving a precision and recall of 83.5% and 99.2%, respectively. And for the correction phase we use n-gram, which results in an accuracy of 98%. Our scheme is robust, with an excellent performance even when the percentage of real-word error words is high. This makes the system suitable for handling errors in post OCR recognition of Arabic text.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2019.2918404