Building a Multilevel Inflection Handling Stemmer to Improve Search Effectiveness for Urdu Language
Stemming is an essential step in various Natural Language Processing (NLP) applications and is used to reduce different variants of the query words to a standard form to avoid the vocabulary mismatch issue in Information Retrieval (IR) systems. Due to specific grammatical rules and complex morpholog...
Gespeichert in:
Veröffentlicht in: | IEEE access 2024, Vol.12, p.39313-39329 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Stemming is an essential step in various Natural Language Processing (NLP) applications and is used to reduce different variants of the query words to a standard form to avoid the vocabulary mismatch issue in Information Retrieval (IR) systems. Due to specific grammatical rules and complex morphological structures, finding an effective stemming algorithm in Urdu is a challenging task. Although, several stemming algorithms have been proposed for the Urdu text stemming; however, none of them extract the stem from multilevel inflected forms. In this context, according to the best of our knowledge, this is a first effort towards the proposition and evaluation of a novel Urdu Text Stemmer (UTS) that can deal with multi-level inflection forms in Urdu text. The experimental evaluation of the proposed scheme has been conducted on the text-based and word-based custom-developed corpus. The proposed stemming technique is rigorously evaluated and compared with state-of-the-art stemming algorithms. Experimental results demonstrate that UTS outperforms existing Urdu stemmers and achieves an accuracy of 94.92% and 91.8% on word corpus and text corpus, respectively. We also evaluated our proposed system in an Information Retrieval application for Urdu, using the Collection for Urdu Retrieval Evaluation (CURE) dataset. Our approach for information retrieval outperformed and improved both recall and precision metrics. |
---|---|
ISSN: | 2169-3536 2169-3536 |
DOI: | 10.1109/ACCESS.2024.3373714 |