UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language

In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Sharipov, Maksud, Yuldashov, Ollabergan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Sharipov, Maksud
Yuldashov, Ollabergan
description In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.
doi_str_mv 10.48550/arxiv.2210.16011
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2210_16011</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2210_16011</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-f028b2e2518af4c79c3d94cddf1700956912720c9e853c6e957c98fb16e0fe603</originalsourceid><addsrcrecordid>eNotj81Kw0AURmfjQqoP4Mp5gdQ7k8yfu1p_IVDQug43kzsxmEnKNC3q06vR1QeHjwOHsQsBy8IqBVeYPrrjUsofIDQIcco2r181vb9MFCOla35LR-rHXaRh4mPgyJ8PPWU3uKeGz6duaPmqb8fUTW-RhzHxWcBLHNoDtnTGTgL2ezr_3wXb3t9t149ZuXl4Wq_KDLURWQBpa0lSCYuh8Mb5vHGFb5ogDIBT2glpJHhHVuVek1PGOxtqoQkCacgX7PJPOxdVu9RFTJ_Vb1k1l-XfhKRIVg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language</title><source>arXiv.org</source><creator>Sharipov, Maksud ; Yuldashov, Ollabergan</creator><creatorcontrib>Sharipov, Maksud ; Yuldashov, Ollabergan</creatorcontrib><description>In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.</description><identifier>DOI: 10.48550/arxiv.2210.16011</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2022-10</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2210.16011$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2210.16011$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Sharipov, Maksud</creatorcontrib><creatorcontrib>Yuldashov, Ollabergan</creatorcontrib><title>UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language</title><description>In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81Kw0AURmfjQqoP4Mp5gdQ7k8yfu1p_IVDQug43kzsxmEnKNC3q06vR1QeHjwOHsQsBy8IqBVeYPrrjUsofIDQIcco2r181vb9MFCOla35LR-rHXaRh4mPgyJ8PPWU3uKeGz6duaPmqb8fUTW-RhzHxWcBLHNoDtnTGTgL2ezr_3wXb3t9t149ZuXl4Wq_KDLURWQBpa0lSCYuh8Mb5vHGFb5ogDIBT2glpJHhHVuVek1PGOxtqoQkCacgX7PJPOxdVu9RFTJ_Vb1k1l-XfhKRIVg</recordid><startdate>20221028</startdate><enddate>20221028</enddate><creator>Sharipov, Maksud</creator><creator>Yuldashov, Ollabergan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20221028</creationdate><title>UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language</title><author>Sharipov, Maksud ; Yuldashov, Ollabergan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-f028b2e2518af4c79c3d94cddf1700956912720c9e853c6e957c98fb16e0fe603</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Sharipov, Maksud</creatorcontrib><creatorcontrib>Yuldashov, Ollabergan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Sharipov, Maksud</au><au>Yuldashov, Ollabergan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language</atitle><date>2022-10-28</date><risdate>2022</risdate><abstract>In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.</abstract><doi>10.48550/arxiv.2210.16011</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2210.16011
ispartof
issn
language eng
recordid cdi_arxiv_primary_2210_16011
source arXiv.org
subjects Computer Science - Computation and Language
title UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T00%3A21%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=UzbekStemmer:%20Development%20of%20a%20Rule-Based%20Stemming%20Algorithm%20for%20Uzbek%20Language&rft.au=Sharipov,%20Maksud&rft.date=2022-10-28&rft_id=info:doi/10.48550/arxiv.2210.16011&rft_dat=%3Carxiv_GOX%3E2210_16011%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true