A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province

The problems of class imbalance have attracted concerns from researchers in the last few years. Class imbalance problems occur when the data had unbalanced proportions between two or more groups of data which are usually called as minority and majority classes. These problems relate to creation of b...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IOP conference series. Earth and environmental science 2018-11, Vol.187 (1), p.12048
Hauptverfasser: Santoso, B, Wijayanto, H, Notodiputro, K A, Sartono, B
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 1
container_start_page 12048
container_title IOP conference series. Earth and environmental science
container_volume 187
creator Santoso, B
Wijayanto, H
Notodiputro, K A
Sartono, B
description The problems of class imbalance have attracted concerns from researchers in the last few years. Class imbalance problems occur when the data had unbalanced proportions between two or more groups of data which are usually called as minority and majority classes. These problems relate to creation of bias in parameter estimation as well as misclassification of the objects especially for the minority class. These will lead to incorrect prediction of the minority class, and eventually will risk the policy making. Several approaches have been proposed to correct misclassification such as data-based and algorithm-based approaches. As a data-based approach, over-sampling method is very popular nowadays. This approach is basically balancing the distribution of data through addition of synthetic data. This paper discusses the strategies of adding synthetic data in order to improve the accuracy of classification. Moreover, this paper also reviews several over sampling methods for class imbalanced problems. Specifically, the classification of poor households is illustrated by using the National Socio-Economic Survey (Susenas) data which has been stratified according to urban and rural areas. Finally, the K-Nearest Neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM) and Generalized Linear Model (GLM) are employed to evaluate the classification performance by comparing the value of sensitivity and area under the ROC curve (AUC). The simulation result shows that there are bias on parameter estimation both on interception and on slope. The bias gets bigger as the data condition becomes more unbalanced and on small sample. Meanwhile, the classification accuracy will decrease with the decrement of probability (high imbalanced) value especially in the data with small sample. Decreased accuracy of classification mainly occurs in the minority class (sensitivity) and AUC. Based on the simulation result, it is clear that the synthetic over sampling approach can improve the accuracy of classification in minority class through increasing sensitivity value and AUC value. This occur at the small probability (unbalanced data). In line with the simulation results, the over sampling approach also shows the evident of improving the prediction of poor households in Yogyakarta Province. But on the other hand, it can also lead to decreased accuracy and specificity. However, further research is required to obtain a more accurate prediction result for all performance measures.
doi_str_mv 10.1088/1755-1315/187/1/012048
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2559477531</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2559477531</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3808-99c027e8d78c1e3a56955442d39ec7531cad1d185f4dfdc1779346b4dfb1ef7e3</originalsourceid><addsrcrecordid>eNqFkEFLwzAYhosoOKd_QQJevNQlTdOkxzGmG0w2mB48hSxJt8ytqUk76MH_bspkIgie8oXvfd6EJ4puEXxAkLEBooTECCMyQIwO0ACiBKbsLOqdFuenGdLL6Mr7LYQZTXHeiz6HYGT3lXCiNgcNlnWjWmALsGzLeqNrI8H8oF3sxb7amXINnnW9sQrUFkz3lbMBCTEw2gnvTWFkaLFlxy-sdWBiG683dqc8MCV4s-tWvAtXC7AIpCmlvo4uCrHz-ub77Eevj-OX0SSezZ-mo-EslphBFue5hAnVTFEmkcaCZDkhaZoonGtJCUZSKKQQI0WqCiURpTlOs1W4rJAuqMb96O7YG7780Whf861tXBme5AkheUq7kpDKjinprPdOF7xyZi9cyxHknWreWeSdUR5Uc8SPqgOYHEFjq5_mf6H7P6DxePkrxitV4C8-146k</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2559477531</pqid></control><display><type>article</type><title>A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province</title><source>IOP Publishing Free Content</source><source>EZB-FREE-00999 freely available EZB journals</source><source>IOPscience extra</source><creator>Santoso, B ; Wijayanto, H ; Notodiputro, K A ; Sartono, B</creator><creatorcontrib>Santoso, B ; Wijayanto, H ; Notodiputro, K A ; Sartono, B</creatorcontrib><description>The problems of class imbalance have attracted concerns from researchers in the last few years. Class imbalance problems occur when the data had unbalanced proportions between two or more groups of data which are usually called as minority and majority classes. These problems relate to creation of bias in parameter estimation as well as misclassification of the objects especially for the minority class. These will lead to incorrect prediction of the minority class, and eventually will risk the policy making. Several approaches have been proposed to correct misclassification such as data-based and algorithm-based approaches. As a data-based approach, over-sampling method is very popular nowadays. This approach is basically balancing the distribution of data through addition of synthetic data. This paper discusses the strategies of adding synthetic data in order to improve the accuracy of classification. Moreover, this paper also reviews several over sampling methods for class imbalanced problems. Specifically, the classification of poor households is illustrated by using the National Socio-Economic Survey (Susenas) data which has been stratified according to urban and rural areas. Finally, the K-Nearest Neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM) and Generalized Linear Model (GLM) are employed to evaluate the classification performance by comparing the value of sensitivity and area under the ROC curve (AUC). The simulation result shows that there are bias on parameter estimation both on interception and on slope. The bias gets bigger as the data condition becomes more unbalanced and on small sample. Meanwhile, the classification accuracy will decrease with the decrement of probability (high imbalanced) value especially in the data with small sample. Decreased accuracy of classification mainly occurs in the minority class (sensitivity) and AUC. Based on the simulation result, it is clear that the synthetic over sampling approach can improve the accuracy of classification in minority class through increasing sensitivity value and AUC value. This occur at the small probability (unbalanced data). In line with the simulation results, the over sampling approach also shows the evident of improving the prediction of poor households in Yogyakarta Province. But on the other hand, it can also lead to decreased accuracy and specificity. However, further research is required to obtain a more accurate prediction result for all performance measures.</description><identifier>ISSN: 1755-1307</identifier><identifier>ISSN: 1755-1315</identifier><identifier>EISSN: 1755-1315</identifier><identifier>DOI: 10.1088/1755-1315/187/1/012048</identifier><language>eng</language><publisher>Bristol: IOP Publishing</publisher><subject>Accuracy ; Algorithms ; Bayesian analysis ; Bias ; Classification ; Comparative studies ; Generalized linear models ; Households ; Interception ; Parameter estimation ; Performance evaluation ; Predictions ; Rural areas ; Sampling ; Sampling methods ; Simulation ; Statistical models ; Support vector machines</subject><ispartof>IOP conference series. Earth and environmental science, 2018-11, Vol.187 (1), p.12048</ispartof><rights>Published under licence by IOP Publishing Ltd</rights><rights>2018. This work is published under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c3808-99c027e8d78c1e3a56955442d39ec7531cad1d185f4dfdc1779346b4dfb1ef7e3</citedby><cites>FETCH-LOGICAL-c3808-99c027e8d78c1e3a56955442d39ec7531cad1d185f4dfdc1779346b4dfb1ef7e3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://iopscience.iop.org/article/10.1088/1755-1315/187/1/012048/pdf$$EPDF$$P50$$Giop$$Hfree_for_read</linktopdf><link.rule.ids>314,780,784,27923,27924,38867,38889,53839,53866</link.rule.ids></links><search><creatorcontrib>Santoso, B</creatorcontrib><creatorcontrib>Wijayanto, H</creatorcontrib><creatorcontrib>Notodiputro, K A</creatorcontrib><creatorcontrib>Sartono, B</creatorcontrib><title>A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province</title><title>IOP conference series. Earth and environmental science</title><addtitle>IOP Conf. Ser.: Earth Environ. Sci</addtitle><description>The problems of class imbalance have attracted concerns from researchers in the last few years. Class imbalance problems occur when the data had unbalanced proportions between two or more groups of data which are usually called as minority and majority classes. These problems relate to creation of bias in parameter estimation as well as misclassification of the objects especially for the minority class. These will lead to incorrect prediction of the minority class, and eventually will risk the policy making. Several approaches have been proposed to correct misclassification such as data-based and algorithm-based approaches. As a data-based approach, over-sampling method is very popular nowadays. This approach is basically balancing the distribution of data through addition of synthetic data. This paper discusses the strategies of adding synthetic data in order to improve the accuracy of classification. Moreover, this paper also reviews several over sampling methods for class imbalanced problems. Specifically, the classification of poor households is illustrated by using the National Socio-Economic Survey (Susenas) data which has been stratified according to urban and rural areas. Finally, the K-Nearest Neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM) and Generalized Linear Model (GLM) are employed to evaluate the classification performance by comparing the value of sensitivity and area under the ROC curve (AUC). The simulation result shows that there are bias on parameter estimation both on interception and on slope. The bias gets bigger as the data condition becomes more unbalanced and on small sample. Meanwhile, the classification accuracy will decrease with the decrement of probability (high imbalanced) value especially in the data with small sample. Decreased accuracy of classification mainly occurs in the minority class (sensitivity) and AUC. Based on the simulation result, it is clear that the synthetic over sampling approach can improve the accuracy of classification in minority class through increasing sensitivity value and AUC value. This occur at the small probability (unbalanced data). In line with the simulation results, the over sampling approach also shows the evident of improving the prediction of poor households in Yogyakarta Province. But on the other hand, it can also lead to decreased accuracy and specificity. However, further research is required to obtain a more accurate prediction result for all performance measures.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Bayesian analysis</subject><subject>Bias</subject><subject>Classification</subject><subject>Comparative studies</subject><subject>Generalized linear models</subject><subject>Households</subject><subject>Interception</subject><subject>Parameter estimation</subject><subject>Performance evaluation</subject><subject>Predictions</subject><subject>Rural areas</subject><subject>Sampling</subject><subject>Sampling methods</subject><subject>Simulation</subject><subject>Statistical models</subject><subject>Support vector machines</subject><issn>1755-1307</issn><issn>1755-1315</issn><issn>1755-1315</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>O3W</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNqFkEFLwzAYhosoOKd_QQJevNQlTdOkxzGmG0w2mB48hSxJt8ytqUk76MH_bspkIgie8oXvfd6EJ4puEXxAkLEBooTECCMyQIwO0ACiBKbsLOqdFuenGdLL6Mr7LYQZTXHeiz6HYGT3lXCiNgcNlnWjWmALsGzLeqNrI8H8oF3sxb7amXINnnW9sQrUFkz3lbMBCTEw2gnvTWFkaLFlxy-sdWBiG683dqc8MCV4s-tWvAtXC7AIpCmlvo4uCrHz-ub77Eevj-OX0SSezZ-mo-EslphBFue5hAnVTFEmkcaCZDkhaZoonGtJCUZSKKQQI0WqCiURpTlOs1W4rJAuqMb96O7YG7780Whf861tXBme5AkheUq7kpDKjinprPdOF7xyZi9cyxHknWreWeSdUR5Uc8SPqgOYHEFjq5_mf6H7P6DxePkrxitV4C8-146k</recordid><startdate>20181119</startdate><enddate>20181119</enddate><creator>Santoso, B</creator><creator>Wijayanto, H</creator><creator>Notodiputro, K A</creator><creator>Sartono, B</creator><general>IOP Publishing</general><scope>O3W</scope><scope>TSCCA</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ATCPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>PATMY</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PYCSY</scope></search><sort><creationdate>20181119</creationdate><title>A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province</title><author>Santoso, B ; Wijayanto, H ; Notodiputro, K A ; Sartono, B</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3808-99c027e8d78c1e3a56955442d39ec7531cad1d185f4dfdc1779346b4dfb1ef7e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Bayesian analysis</topic><topic>Bias</topic><topic>Classification</topic><topic>Comparative studies</topic><topic>Generalized linear models</topic><topic>Households</topic><topic>Interception</topic><topic>Parameter estimation</topic><topic>Performance evaluation</topic><topic>Predictions</topic><topic>Rural areas</topic><topic>Sampling</topic><topic>Sampling methods</topic><topic>Simulation</topic><topic>Statistical models</topic><topic>Support vector machines</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Santoso, B</creatorcontrib><creatorcontrib>Wijayanto, H</creatorcontrib><creatorcontrib>Notodiputro, K A</creatorcontrib><creatorcontrib>Sartono, B</creatorcontrib><collection>IOP Publishing Free Content</collection><collection>IOPscience (Open Access)</collection><collection>CrossRef</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Agricultural &amp; Environmental Science Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>Environmental Science Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Environmental Science Collection</collection><jtitle>IOP conference series. Earth and environmental science</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Santoso, B</au><au>Wijayanto, H</au><au>Notodiputro, K A</au><au>Sartono, B</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province</atitle><jtitle>IOP conference series. Earth and environmental science</jtitle><addtitle>IOP Conf. Ser.: Earth Environ. Sci</addtitle><date>2018-11-19</date><risdate>2018</risdate><volume>187</volume><issue>1</issue><spage>12048</spage><pages>12048-</pages><issn>1755-1307</issn><issn>1755-1315</issn><eissn>1755-1315</eissn><abstract>The problems of class imbalance have attracted concerns from researchers in the last few years. Class imbalance problems occur when the data had unbalanced proportions between two or more groups of data which are usually called as minority and majority classes. These problems relate to creation of bias in parameter estimation as well as misclassification of the objects especially for the minority class. These will lead to incorrect prediction of the minority class, and eventually will risk the policy making. Several approaches have been proposed to correct misclassification such as data-based and algorithm-based approaches. As a data-based approach, over-sampling method is very popular nowadays. This approach is basically balancing the distribution of data through addition of synthetic data. This paper discusses the strategies of adding synthetic data in order to improve the accuracy of classification. Moreover, this paper also reviews several over sampling methods for class imbalanced problems. Specifically, the classification of poor households is illustrated by using the National Socio-Economic Survey (Susenas) data which has been stratified according to urban and rural areas. Finally, the K-Nearest Neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM) and Generalized Linear Model (GLM) are employed to evaluate the classification performance by comparing the value of sensitivity and area under the ROC curve (AUC). The simulation result shows that there are bias on parameter estimation both on interception and on slope. The bias gets bigger as the data condition becomes more unbalanced and on small sample. Meanwhile, the classification accuracy will decrease with the decrement of probability (high imbalanced) value especially in the data with small sample. Decreased accuracy of classification mainly occurs in the minority class (sensitivity) and AUC. Based on the simulation result, it is clear that the synthetic over sampling approach can improve the accuracy of classification in minority class through increasing sensitivity value and AUC value. This occur at the small probability (unbalanced data). In line with the simulation results, the over sampling approach also shows the evident of improving the prediction of poor households in Yogyakarta Province. But on the other hand, it can also lead to decreased accuracy and specificity. However, further research is required to obtain a more accurate prediction result for all performance measures.</abstract><cop>Bristol</cop><pub>IOP Publishing</pub><doi>10.1088/1755-1315/187/1/012048</doi><tpages>18</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1755-1307
ispartof IOP conference series. Earth and environmental science, 2018-11, Vol.187 (1), p.12048
issn 1755-1307
1755-1315
1755-1315
language eng
recordid cdi_proquest_journals_2559477531
source IOP Publishing Free Content; EZB-FREE-00999 freely available EZB journals; IOPscience extra
subjects Accuracy
Algorithms
Bayesian analysis
Bias
Classification
Comparative studies
Generalized linear models
Households
Interception
Parameter estimation
Performance evaluation
Predictions
Rural areas
Sampling
Sampling methods
Simulation
Statistical models
Support vector machines
title A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T12%3A19%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Comparative%20Study%20of%20Synthetic%20Over-sampling%20Method%20to%20Improve%20the%20Classification%20of%20Poor%20Households%20in%20Yogyakarta%20Province&rft.jtitle=IOP%20conference%20series.%20Earth%20and%20environmental%20science&rft.au=Santoso,%20B&rft.date=2018-11-19&rft.volume=187&rft.issue=1&rft.spage=12048&rft.pages=12048-&rft.issn=1755-1307&rft.eissn=1755-1315&rft_id=info:doi/10.1088/1755-1315/187/1/012048&rft_dat=%3Cproquest_cross%3E2559477531%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2559477531&rft_id=info:pmid/&rfr_iscdi=true