A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province
The problems of class imbalance have attracted concerns from researchers in the last few years. Class imbalance problems occur when the data had unbalanced proportions between two or more groups of data which are usually called as minority and majority classes. These problems relate to creation of b...
Gespeichert in:
Veröffentlicht in: | IOP conference series. Earth and environmental science 2018-11, Vol.187 (1), p.12048 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | 1 |
container_start_page | 12048 |
container_title | IOP conference series. Earth and environmental science |
container_volume | 187 |
creator | Santoso, B Wijayanto, H Notodiputro, K A Sartono, B |
description | The problems of class imbalance have attracted concerns from researchers in the last few years. Class imbalance problems occur when the data had unbalanced proportions between two or more groups of data which are usually called as minority and majority classes. These problems relate to creation of bias in parameter estimation as well as misclassification of the objects especially for the minority class. These will lead to incorrect prediction of the minority class, and eventually will risk the policy making. Several approaches have been proposed to correct misclassification such as data-based and algorithm-based approaches. As a data-based approach, over-sampling method is very popular nowadays. This approach is basically balancing the distribution of data through addition of synthetic data. This paper discusses the strategies of adding synthetic data in order to improve the accuracy of classification. Moreover, this paper also reviews several over sampling methods for class imbalanced problems. Specifically, the classification of poor households is illustrated by using the National Socio-Economic Survey (Susenas) data which has been stratified according to urban and rural areas. Finally, the K-Nearest Neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM) and Generalized Linear Model (GLM) are employed to evaluate the classification performance by comparing the value of sensitivity and area under the ROC curve (AUC). The simulation result shows that there are bias on parameter estimation both on interception and on slope. The bias gets bigger as the data condition becomes more unbalanced and on small sample. Meanwhile, the classification accuracy will decrease with the decrement of probability (high imbalanced) value especially in the data with small sample. Decreased accuracy of classification mainly occurs in the minority class (sensitivity) and AUC. Based on the simulation result, it is clear that the synthetic over sampling approach can improve the accuracy of classification in minority class through increasing sensitivity value and AUC value. This occur at the small probability (unbalanced data). In line with the simulation results, the over sampling approach also shows the evident of improving the prediction of poor households in Yogyakarta Province. But on the other hand, it can also lead to decreased accuracy and specificity. However, further research is required to obtain a more accurate prediction result for all performance measures. |
doi_str_mv | 10.1088/1755-1315/187/1/012048 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2559477531</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2559477531</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3808-99c027e8d78c1e3a56955442d39ec7531cad1d185f4dfdc1779346b4dfb1ef7e3</originalsourceid><addsrcrecordid>eNqFkEFLwzAYhosoOKd_QQJevNQlTdOkxzGmG0w2mB48hSxJt8ytqUk76MH_bspkIgie8oXvfd6EJ4puEXxAkLEBooTECCMyQIwO0ACiBKbsLOqdFuenGdLL6Mr7LYQZTXHeiz6HYGT3lXCiNgcNlnWjWmALsGzLeqNrI8H8oF3sxb7amXINnnW9sQrUFkz3lbMBCTEw2gnvTWFkaLFlxy-sdWBiG683dqc8MCV4s-tWvAtXC7AIpCmlvo4uCrHz-ub77Eevj-OX0SSezZ-mo-EslphBFue5hAnVTFEmkcaCZDkhaZoonGtJCUZSKKQQI0WqCiURpTlOs1W4rJAuqMb96O7YG7780Whf861tXBme5AkheUq7kpDKjinprPdOF7xyZi9cyxHknWreWeSdUR5Uc8SPqgOYHEFjq5_mf6H7P6DxePkrxitV4C8-146k</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2559477531</pqid></control><display><type>article</type><title>A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province</title><source>IOP Publishing Free Content</source><source>EZB-FREE-00999 freely available EZB journals</source><source>IOPscience extra</source><creator>Santoso, B ; Wijayanto, H ; Notodiputro, K A ; Sartono, B</creator><creatorcontrib>Santoso, B ; Wijayanto, H ; Notodiputro, K A ; Sartono, B</creatorcontrib><description>The problems of class imbalance have attracted concerns from researchers in the last few years. Class imbalance problems occur when the data had unbalanced proportions between two or more groups of data which are usually called as minority and majority classes. These problems relate to creation of bias in parameter estimation as well as misclassification of the objects especially for the minority class. These will lead to incorrect prediction of the minority class, and eventually will risk the policy making. Several approaches have been proposed to correct misclassification such as data-based and algorithm-based approaches. As a data-based approach, over-sampling method is very popular nowadays. This approach is basically balancing the distribution of data through addition of synthetic data. This paper discusses the strategies of adding synthetic data in order to improve the accuracy of classification. Moreover, this paper also reviews several over sampling methods for class imbalanced problems. Specifically, the classification of poor households is illustrated by using the National Socio-Economic Survey (Susenas) data which has been stratified according to urban and rural areas. Finally, the K-Nearest Neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM) and Generalized Linear Model (GLM) are employed to evaluate the classification performance by comparing the value of sensitivity and area under the ROC curve (AUC). The simulation result shows that there are bias on parameter estimation both on interception and on slope. The bias gets bigger as the data condition becomes more unbalanced and on small sample. Meanwhile, the classification accuracy will decrease with the decrement of probability (high imbalanced) value especially in the data with small sample. Decreased accuracy of classification mainly occurs in the minority class (sensitivity) and AUC. Based on the simulation result, it is clear that the synthetic over sampling approach can improve the accuracy of classification in minority class through increasing sensitivity value and AUC value. This occur at the small probability (unbalanced data). In line with the simulation results, the over sampling approach also shows the evident of improving the prediction of poor households in Yogyakarta Province. But on the other hand, it can also lead to decreased accuracy and specificity. However, further research is required to obtain a more accurate prediction result for all performance measures.</description><identifier>ISSN: 1755-1307</identifier><identifier>ISSN: 1755-1315</identifier><identifier>EISSN: 1755-1315</identifier><identifier>DOI: 10.1088/1755-1315/187/1/012048</identifier><language>eng</language><publisher>Bristol: IOP Publishing</publisher><subject>Accuracy ; Algorithms ; Bayesian analysis ; Bias ; Classification ; Comparative studies ; Generalized linear models ; Households ; Interception ; Parameter estimation ; Performance evaluation ; Predictions ; Rural areas ; Sampling ; Sampling methods ; Simulation ; Statistical models ; Support vector machines</subject><ispartof>IOP conference series. Earth and environmental science, 2018-11, Vol.187 (1), p.12048</ispartof><rights>Published under licence by IOP Publishing Ltd</rights><rights>2018. This work is published under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c3808-99c027e8d78c1e3a56955442d39ec7531cad1d185f4dfdc1779346b4dfb1ef7e3</citedby><cites>FETCH-LOGICAL-c3808-99c027e8d78c1e3a56955442d39ec7531cad1d185f4dfdc1779346b4dfb1ef7e3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://iopscience.iop.org/article/10.1088/1755-1315/187/1/012048/pdf$$EPDF$$P50$$Giop$$Hfree_for_read</linktopdf><link.rule.ids>314,780,784,27923,27924,38867,38889,53839,53866</link.rule.ids></links><search><creatorcontrib>Santoso, B</creatorcontrib><creatorcontrib>Wijayanto, H</creatorcontrib><creatorcontrib>Notodiputro, K A</creatorcontrib><creatorcontrib>Sartono, B</creatorcontrib><title>A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province</title><title>IOP conference series. Earth and environmental science</title><addtitle>IOP Conf. Ser.: Earth Environ. Sci</addtitle><description>The problems of class imbalance have attracted concerns from researchers in the last few years. Class imbalance problems occur when the data had unbalanced proportions between two or more groups of data which are usually called as minority and majority classes. These problems relate to creation of bias in parameter estimation as well as misclassification of the objects especially for the minority class. These will lead to incorrect prediction of the minority class, and eventually will risk the policy making. Several approaches have been proposed to correct misclassification such as data-based and algorithm-based approaches. As a data-based approach, over-sampling method is very popular nowadays. This approach is basically balancing the distribution of data through addition of synthetic data. This paper discusses the strategies of adding synthetic data in order to improve the accuracy of classification. Moreover, this paper also reviews several over sampling methods for class imbalanced problems. Specifically, the classification of poor households is illustrated by using the National Socio-Economic Survey (Susenas) data which has been stratified according to urban and rural areas. Finally, the K-Nearest Neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM) and Generalized Linear Model (GLM) are employed to evaluate the classification performance by comparing the value of sensitivity and area under the ROC curve (AUC). The simulation result shows that there are bias on parameter estimation both on interception and on slope. The bias gets bigger as the data condition becomes more unbalanced and on small sample. Meanwhile, the classification accuracy will decrease with the decrement of probability (high imbalanced) value especially in the data with small sample. Decreased accuracy of classification mainly occurs in the minority class (sensitivity) and AUC. Based on the simulation result, it is clear that the synthetic over sampling approach can improve the accuracy of classification in minority class through increasing sensitivity value and AUC value. This occur at the small probability (unbalanced data). In line with the simulation results, the over sampling approach also shows the evident of improving the prediction of poor households in Yogyakarta Province. But on the other hand, it can also lead to decreased accuracy and specificity. However, further research is required to obtain a more accurate prediction result for all performance measures.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Bayesian analysis</subject><subject>Bias</subject><subject>Classification</subject><subject>Comparative studies</subject><subject>Generalized linear models</subject><subject>Households</subject><subject>Interception</subject><subject>Parameter estimation</subject><subject>Performance evaluation</subject><subject>Predictions</subject><subject>Rural areas</subject><subject>Sampling</subject><subject>Sampling methods</subject><subject>Simulation</subject><subject>Statistical models</subject><subject>Support vector machines</subject><issn>1755-1307</issn><issn>1755-1315</issn><issn>1755-1315</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>O3W</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNqFkEFLwzAYhosoOKd_QQJevNQlTdOkxzGmG0w2mB48hSxJt8ytqUk76MH_bspkIgie8oXvfd6EJ4puEXxAkLEBooTECCMyQIwO0ACiBKbsLOqdFuenGdLL6Mr7LYQZTXHeiz6HYGT3lXCiNgcNlnWjWmALsGzLeqNrI8H8oF3sxb7amXINnnW9sQrUFkz3lbMBCTEw2gnvTWFkaLFlxy-sdWBiG683dqc8MCV4s-tWvAtXC7AIpCmlvo4uCrHz-ub77Eevj-OX0SSezZ-mo-EslphBFue5hAnVTFEmkcaCZDkhaZoonGtJCUZSKKQQI0WqCiURpTlOs1W4rJAuqMb96O7YG7780Whf861tXBme5AkheUq7kpDKjinprPdOF7xyZi9cyxHknWreWeSdUR5Uc8SPqgOYHEFjq5_mf6H7P6DxePkrxitV4C8-146k</recordid><startdate>20181119</startdate><enddate>20181119</enddate><creator>Santoso, B</creator><creator>Wijayanto, H</creator><creator>Notodiputro, K A</creator><creator>Sartono, B</creator><general>IOP Publishing</general><scope>O3W</scope><scope>TSCCA</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ATCPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>PATMY</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PYCSY</scope></search><sort><creationdate>20181119</creationdate><title>A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province</title><author>Santoso, B ; Wijayanto, H ; Notodiputro, K A ; Sartono, B</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3808-99c027e8d78c1e3a56955442d39ec7531cad1d185f4dfdc1779346b4dfb1ef7e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Bayesian analysis</topic><topic>Bias</topic><topic>Classification</topic><topic>Comparative studies</topic><topic>Generalized linear models</topic><topic>Households</topic><topic>Interception</topic><topic>Parameter estimation</topic><topic>Performance evaluation</topic><topic>Predictions</topic><topic>Rural areas</topic><topic>Sampling</topic><topic>Sampling methods</topic><topic>Simulation</topic><topic>Statistical models</topic><topic>Support vector machines</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Santoso, B</creatorcontrib><creatorcontrib>Wijayanto, H</creatorcontrib><creatorcontrib>Notodiputro, K A</creatorcontrib><creatorcontrib>Sartono, B</creatorcontrib><collection>IOP Publishing Free Content</collection><collection>IOPscience (Open Access)</collection><collection>CrossRef</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Agricultural & Environmental Science Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>Environmental Science Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Environmental Science Collection</collection><jtitle>IOP conference series. Earth and environmental science</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Santoso, B</au><au>Wijayanto, H</au><au>Notodiputro, K A</au><au>Sartono, B</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province</atitle><jtitle>IOP conference series. Earth and environmental science</jtitle><addtitle>IOP Conf. Ser.: Earth Environ. Sci</addtitle><date>2018-11-19</date><risdate>2018</risdate><volume>187</volume><issue>1</issue><spage>12048</spage><pages>12048-</pages><issn>1755-1307</issn><issn>1755-1315</issn><eissn>1755-1315</eissn><abstract>The problems of class imbalance have attracted concerns from researchers in the last few years. Class imbalance problems occur when the data had unbalanced proportions between two or more groups of data which are usually called as minority and majority classes. These problems relate to creation of bias in parameter estimation as well as misclassification of the objects especially for the minority class. These will lead to incorrect prediction of the minority class, and eventually will risk the policy making. Several approaches have been proposed to correct misclassification such as data-based and algorithm-based approaches. As a data-based approach, over-sampling method is very popular nowadays. This approach is basically balancing the distribution of data through addition of synthetic data. This paper discusses the strategies of adding synthetic data in order to improve the accuracy of classification. Moreover, this paper also reviews several over sampling methods for class imbalanced problems. Specifically, the classification of poor households is illustrated by using the National Socio-Economic Survey (Susenas) data which has been stratified according to urban and rural areas. Finally, the K-Nearest Neighbor (KNN), Naïve Bayes, Support Vector Machine (SVM) and Generalized Linear Model (GLM) are employed to evaluate the classification performance by comparing the value of sensitivity and area under the ROC curve (AUC). The simulation result shows that there are bias on parameter estimation both on interception and on slope. The bias gets bigger as the data condition becomes more unbalanced and on small sample. Meanwhile, the classification accuracy will decrease with the decrement of probability (high imbalanced) value especially in the data with small sample. Decreased accuracy of classification mainly occurs in the minority class (sensitivity) and AUC. Based on the simulation result, it is clear that the synthetic over sampling approach can improve the accuracy of classification in minority class through increasing sensitivity value and AUC value. This occur at the small probability (unbalanced data). In line with the simulation results, the over sampling approach also shows the evident of improving the prediction of poor households in Yogyakarta Province. But on the other hand, it can also lead to decreased accuracy and specificity. However, further research is required to obtain a more accurate prediction result for all performance measures.</abstract><cop>Bristol</cop><pub>IOP Publishing</pub><doi>10.1088/1755-1315/187/1/012048</doi><tpages>18</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1755-1307 |
ispartof | IOP conference series. Earth and environmental science, 2018-11, Vol.187 (1), p.12048 |
issn | 1755-1307 1755-1315 1755-1315 |
language | eng |
recordid | cdi_proquest_journals_2559477531 |
source | IOP Publishing Free Content; EZB-FREE-00999 freely available EZB journals; IOPscience extra |
subjects | Accuracy Algorithms Bayesian analysis Bias Classification Comparative studies Generalized linear models Households Interception Parameter estimation Performance evaluation Predictions Rural areas Sampling Sampling methods Simulation Statistical models Support vector machines |
title | A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T12%3A19%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Comparative%20Study%20of%20Synthetic%20Over-sampling%20Method%20to%20Improve%20the%20Classification%20of%20Poor%20Households%20in%20Yogyakarta%20Province&rft.jtitle=IOP%20conference%20series.%20Earth%20and%20environmental%20science&rft.au=Santoso,%20B&rft.date=2018-11-19&rft.volume=187&rft.issue=1&rft.spage=12048&rft.pages=12048-&rft.issn=1755-1307&rft.eissn=1755-1315&rft_id=info:doi/10.1088/1755-1315/187/1/012048&rft_dat=%3Cproquest_cross%3E2559477531%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2559477531&rft_id=info:pmid/&rfr_iscdi=true |