A hybrid model for spelling error detection and correction for Urdu language

Detecting and correcting misspelled words in a written text are of great importance in many natural language processing applications. Errors can be broadly classified into two groups, namely spelling error and contextual errors. Spelling errors occur when the misspelled words do not exist in a dicti...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neural computing & applications 2021-11, Vol.33 (21), p.14707-14721
Hauptverfasser: Aziz, Romila, Anwar, Muhammad Waqas, Jamal, Muhammad Hasan, Bajwa, Usama Ijaz
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Detecting and correcting misspelled words in a written text are of great importance in many natural language processing applications. Errors can be broadly classified into two groups, namely spelling error and contextual errors. Spelling errors occur when the misspelled words do not exist in a dictionary and are meaningless, while contextual errors occur when the words do exist in the dictionary, but their use is not as intended by the writer. This paper presents an “Urdu Spell Checker” that detects incorrect spellings of a word using widely used lexicon lookup approach and provides a list of candidate words containing correct spellings by applying the edit distance technique which covers all types of spelling errors. To identify the best candidate word, this paper proposes a hybrid model that ranks the words in the candidate word list. Multiple ranking techniques such as Soundex, Shapex, LCS and N-gram are used standalone, as well in combination, to determine the best technique in terms of F1 score. A dictionary containing 48,551 words is developed from UMC corpus and Urdu newspaper corpus. Our hybrid model achieves an F1 score of 94.02% when considering top five suggested words and an F1 score of 88.29% when considering top one suggested word.
ISSN:0941-0643
1433-3058
DOI:10.1007/s00521-021-06110-7