UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language

In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Sharipov, Maksud, Yuldashov, Ollabergan
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Sharipov, Maksud Yuldashov, Ollabergan
description	In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.
doi_str_mv	10.48550/arxiv.2210.16011
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2210_16011</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2210_16011</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-f028b2e2518af4c79c3d94cddf1700956912720c9e853c6e957c98fb16e0fe603</originalsourceid><addsrcrecordid>eNotj81Kw0AURmfjQqoP4Mp5gdQ7k8yfu1p_IVDQug43kzsxmEnKNC3q06vR1QeHjwOHsQsBy8IqBVeYPrrjUsofIDQIcco2r181vb9MFCOla35LR-rHXaRh4mPgyJ8PPWU3uKeGz6duaPmqb8fUTW-RhzHxWcBLHNoDtnTGTgL2ezr_3wXb3t9t149ZuXl4Wq_KDLURWQBpa0lSCYuh8Mb5vHGFb5ogDIBT2glpJHhHVuVek1PGOxtqoQkCacgX7PJPOxdVu9RFTJ_Vb1k1l-XfhKRIVg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language</title><source>arXiv.org</source><creator>Sharipov, Maksud ; Yuldashov, Ollabergan</creator><creatorcontrib>Sharipov, Maksud ; Yuldashov, Ollabergan</creatorcontrib><description>In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.</description><identifier>DOI: 10.48550/arxiv.2210.16011</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2022-10</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2210.16011$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2210.16011$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Sharipov, Maksud</creatorcontrib><creatorcontrib>Yuldashov, Ollabergan</creatorcontrib><title>UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language</title><description>In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81Kw0AURmfjQqoP4Mp5gdQ7k8yfu1p_IVDQug43kzsxmEnKNC3q06vR1QeHjwOHsQsBy8IqBVeYPrrjUsofIDQIcco2r181vb9MFCOla35LR-rHXaRh4mPgyJ8PPWU3uKeGz6duaPmqb8fUTW-RhzHxWcBLHNoDtnTGTgL2ezr_3wXb3t9t149ZuXl4Wq_KDLURWQBpa0lSCYuh8Mb5vHGFb5ogDIBT2glpJHhHVuVek1PGOxtqoQkCacgX7PJPOxdVu9RFTJ_Vb1k1l-XfhKRIVg</recordid><startdate>20221028</startdate><enddate>20221028</enddate><creator>Sharipov, Maksud</creator><creator>Yuldashov, Ollabergan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20221028</creationdate><title>UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language</title><author>Sharipov, Maksud ; Yuldashov, Ollabergan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-f028b2e2518af4c79c3d94cddf1700956912720c9e853c6e957c98fb16e0fe603</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Sharipov, Maksud</creatorcontrib><creatorcontrib>Yuldashov, Ollabergan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Sharipov, Maksud</au><au>Yuldashov, Ollabergan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language</atitle><date>2022-10-28</date><risdate>2022</risdate><abstract>In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.</abstract><doi>10.48550/arxiv.2210.16011</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2210.16011
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2210_16011
source	arXiv.org
subjects	Computer Science - Computation and Language
title	UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T00%3A21%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=UzbekStemmer:%20Development%20of%20a%20Rule-Based%20Stemming%20Algorithm%20for%20Uzbek%20Language&rft.au=Sharipov,%20Maksud&rft.date=2022-10-28&rft_id=info:doi/10.48550/arxiv.2210.16011&rft_dat=%3Carxiv_GOX%3E2210_16011%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true