Automatic Extraction of Indonesian Stopwords

The rapid growth of the Indonesian language content on the Internet has drawn researchers’ attention. By using natural language processing, they can extract high-value information from such content and documents. However, processing large and numerous documents is very time-consuming and computation...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of advanced computer science & applications 2023, Vol.14 (2)
Hauptverfasser:	Achsan, Harry Tursulistyono Yani, Suhartanto, Heru, Wibowo, Wahyu Catur, Dewi, Deshinta A., Ismed, Khairul
Format:	Artikel
Sprache:	eng
Schlagworte:	Documents Language Natural language processing Words (language)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	2
container_start_page
container_title	International journal of advanced computer science & applications
container_volume	14
creator	Achsan, Harry Tursulistyono Yani Suhartanto, Heru Wibowo, Wahyu Catur Dewi, Deshinta A. Ismed, Khairul
description	The rapid growth of the Indonesian language content on the Internet has drawn researchers’ attention. By using natural language processing, they can extract high-value information from such content and documents. However, processing large and numerous documents is very time-consuming and computationally expensive. Reducing these computational costs requires attribute reduction by removing some common words or stopwords. This research aims to extract stopwords automatically from a large corpus, about seven million words, in the Indonesian language downloaded from the web. The problem is that Indonesian is a low-resource language, making it challenging to develop an automatic stopword extractor. The method used is Term Frequency – Inverse Document Frequency (TF-IDF) and presents a methodology for ranking stopwords using TFs and IDFs, which is applicable to even a small corpus (as low as one document). It is an automatic method that can be applied to many different languages with no prior linguistic knowledge required. There are two novelties or contributions in this method: it can show all words found in all documents, and it has an automatic cut-off technique for selecting the top rank of stopwords candidates in the Indonesian language, overcoming one of the most challenging aspects of stopwords extraction.
doi_str_mv	10.14569/IJACSA.2023.0140221
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2791786186</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2791786186</sourcerecordid><originalsourceid>FETCH-LOGICAL-c274t-5e1ac686977d544469352e039d24e1750dfd65501eadafae679de041ea167af63</originalsourceid><addsrcrecordid>eNotkEtLxDAUhYMoOIzzD1wU3Nqam2e7LGV0KgMuRsFdCE0CHZymJinqv7fzOJt7DhzugQ-he8AFMC6qp_a1bnZ1QTChBQaGCYErtCDARc65xNcnX-aA5ectWsW4x7NoRURJF-ixnpI_6NR32fo3Bd2l3g-Zd1k7GD_Y2Osh2yU__vhg4h26cfor2tXlLtHH8_q92eTbt5e2qbd5RyRLObegO1GKSkrDGWOiopzYedIQZkFybJwRnGOw2minrZCVsZjNEYTUTtAlejj_HYP_nmxMau-nMMyTisgKZCmgPLbYudUFH2OwTo2hP-jwpwCrExp1RqOOaNQFDf0H75NVng</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2791786186</pqid></control><display><type>article</type><title>Automatic Extraction of Indonesian Stopwords</title><source>EZB-FREE-00999 freely available EZB journals</source><creator>Achsan, Harry Tursulistyono Yani ; Suhartanto, Heru ; Wibowo, Wahyu Catur ; Dewi, Deshinta A. ; Ismed, Khairul</creator><creatorcontrib>Achsan, Harry Tursulistyono Yani ; Suhartanto, Heru ; Wibowo, Wahyu Catur ; Dewi, Deshinta A. ; Ismed, Khairul</creatorcontrib><description>The rapid growth of the Indonesian language content on the Internet has drawn researchers’ attention. By using natural language processing, they can extract high-value information from such content and documents. However, processing large and numerous documents is very time-consuming and computationally expensive. Reducing these computational costs requires attribute reduction by removing some common words or stopwords. This research aims to extract stopwords automatically from a large corpus, about seven million words, in the Indonesian language downloaded from the web. The problem is that Indonesian is a low-resource language, making it challenging to develop an automatic stopword extractor. The method used is Term Frequency – Inverse Document Frequency (TF-IDF) and presents a methodology for ranking stopwords using TFs and IDFs, which is applicable to even a small corpus (as low as one document). It is an automatic method that can be applied to many different languages with no prior linguistic knowledge required. There are two novelties or contributions in this method: it can show all words found in all documents, and it has an automatic cut-off technique for selecting the top rank of stopwords candidates in the Indonesian language, overcoming one of the most challenging aspects of stopwords extraction.</description><identifier>ISSN: 2158-107X</identifier><identifier>EISSN: 2156-5570</identifier><identifier>DOI: 10.14569/IJACSA.2023.0140221</identifier><language>eng</language><publisher>West Yorkshire: Science and Information (SAI) Organization Limited</publisher><subject>Documents ; Language ; Natural language processing ; Words (language)</subject><ispartof>International journal of advanced computer science & applications, 2023, Vol.14 (2)</ispartof><rights>2023. This work is licensed under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,4024,27923,27924,27925</link.rule.ids></links><search><creatorcontrib>Achsan, Harry Tursulistyono Yani</creatorcontrib><creatorcontrib>Suhartanto, Heru</creatorcontrib><creatorcontrib>Wibowo, Wahyu Catur</creatorcontrib><creatorcontrib>Dewi, Deshinta A.</creatorcontrib><creatorcontrib>Ismed, Khairul</creatorcontrib><title>Automatic Extraction of Indonesian Stopwords</title><title>International journal of advanced computer science & applications</title><description>The rapid growth of the Indonesian language content on the Internet has drawn researchers’ attention. By using natural language processing, they can extract high-value information from such content and documents. However, processing large and numerous documents is very time-consuming and computationally expensive. Reducing these computational costs requires attribute reduction by removing some common words or stopwords. This research aims to extract stopwords automatically from a large corpus, about seven million words, in the Indonesian language downloaded from the web. The problem is that Indonesian is a low-resource language, making it challenging to develop an automatic stopword extractor. The method used is Term Frequency – Inverse Document Frequency (TF-IDF) and presents a methodology for ranking stopwords using TFs and IDFs, which is applicable to even a small corpus (as low as one document). It is an automatic method that can be applied to many different languages with no prior linguistic knowledge required. There are two novelties or contributions in this method: it can show all words found in all documents, and it has an automatic cut-off technique for selecting the top rank of stopwords candidates in the Indonesian language, overcoming one of the most challenging aspects of stopwords extraction.</description><subject>Documents</subject><subject>Language</subject><subject>Natural language processing</subject><subject>Words (language)</subject><issn>2158-107X</issn><issn>2156-5570</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>8G5</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>GUQSH</sourceid><sourceid>M2O</sourceid><recordid>eNotkEtLxDAUhYMoOIzzD1wU3Nqam2e7LGV0KgMuRsFdCE0CHZymJinqv7fzOJt7DhzugQ-he8AFMC6qp_a1bnZ1QTChBQaGCYErtCDARc65xNcnX-aA5ectWsW4x7NoRURJF-ixnpI_6NR32fo3Bd2l3g-Zd1k7GD_Y2Osh2yU__vhg4h26cfor2tXlLtHH8_q92eTbt5e2qbd5RyRLObegO1GKSkrDGWOiopzYedIQZkFybJwRnGOw2minrZCVsZjNEYTUTtAlejj_HYP_nmxMau-nMMyTisgKZCmgPLbYudUFH2OwTo2hP-jwpwCrExp1RqOOaNQFDf0H75NVng</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Achsan, Harry Tursulistyono Yani</creator><creator>Suhartanto, Heru</creator><creator>Wibowo, Wahyu Catur</creator><creator>Dewi, Deshinta A.</creator><creator>Ismed, Khairul</creator><general>Science and Information (SAI) Organization Limited</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7XB</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>COVID</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope></search><sort><creationdate>2023</creationdate><title>Automatic Extraction of Indonesian Stopwords</title><author>Achsan, Harry Tursulistyono Yani ; Suhartanto, Heru ; Wibowo, Wahyu Catur ; Dewi, Deshinta A. ; Ismed, Khairul</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c274t-5e1ac686977d544469352e039d24e1750dfd65501eadafae679de041ea167af63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Documents</topic><topic>Language</topic><topic>Natural language processing</topic><topic>Words (language)</topic><toplevel>online_resources</toplevel><creatorcontrib>Achsan, Harry Tursulistyono Yani</creatorcontrib><creatorcontrib>Suhartanto, Heru</creatorcontrib><creatorcontrib>Wibowo, Wahyu Catur</creatorcontrib><creatorcontrib>Dewi, Deshinta A.</creatorcontrib><creatorcontrib>Ismed, Khairul</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>Coronavirus Research Database</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Research Library</collection><collection>Research Library (Corporate)</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><jtitle>International journal of advanced computer science & applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Achsan, Harry Tursulistyono Yani</au><au>Suhartanto, Heru</au><au>Wibowo, Wahyu Catur</au><au>Dewi, Deshinta A.</au><au>Ismed, Khairul</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Automatic Extraction of Indonesian Stopwords</atitle><jtitle>International journal of advanced computer science & applications</jtitle><date>2023</date><risdate>2023</risdate><volume>14</volume><issue>2</issue><issn>2158-107X</issn><eissn>2156-5570</eissn><abstract>The rapid growth of the Indonesian language content on the Internet has drawn researchers’ attention. By using natural language processing, they can extract high-value information from such content and documents. However, processing large and numerous documents is very time-consuming and computationally expensive. Reducing these computational costs requires attribute reduction by removing some common words or stopwords. This research aims to extract stopwords automatically from a large corpus, about seven million words, in the Indonesian language downloaded from the web. The problem is that Indonesian is a low-resource language, making it challenging to develop an automatic stopword extractor. The method used is Term Frequency – Inverse Document Frequency (TF-IDF) and presents a methodology for ranking stopwords using TFs and IDFs, which is applicable to even a small corpus (as low as one document). It is an automatic method that can be applied to many different languages with no prior linguistic knowledge required. There are two novelties or contributions in this method: it can show all words found in all documents, and it has an automatic cut-off technique for selecting the top rank of stopwords candidates in the Indonesian language, overcoming one of the most challenging aspects of stopwords extraction.</abstract><cop>West Yorkshire</cop><pub>Science and Information (SAI) Organization Limited</pub><doi>10.14569/IJACSA.2023.0140221</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2158-107X
ispartof	International journal of advanced computer science & applications, 2023, Vol.14 (2)
issn	2158-107X 2156-5570
language	eng
recordid	cdi_proquest_journals_2791786186
source	EZB-FREE-00999 freely available EZB journals
subjects	Documents Language Natural language processing Words (language)
title	Automatic Extraction of Indonesian Stopwords
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T14%3A43%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Automatic%20Extraction%20of%20Indonesian%20Stopwords&rft.jtitle=International%20journal%20of%20advanced%20computer%20science%20&%20applications&rft.au=Achsan,%20Harry%20Tursulistyono%20Yani&rft.date=2023&rft.volume=14&rft.issue=2&rft.issn=2158-107X&rft.eissn=2156-5570&rft_id=info:doi/10.14569/IJACSA.2023.0140221&rft_dat=%3Cproquest_cross%3E2791786186%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2791786186&rft_id=info:pmid/&rfr_iscdi=true