OUBoost: boosting based over and under sampling technique for handling imbalanced data

Most real-world datasets usually contain imbalanced data. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. The overall prediction accuracy in imbalanced datasets is higher tha...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of machine learning and cybernetics 2023-10, Vol.14 (10), p.3393-3411
Hauptverfasser: Mostafaei, Sahar Hassanzadeh, Tanha, Jafar
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3411
container_issue 10
container_start_page 3393
container_title International journal of machine learning and cybernetics
container_volume 14
creator Mostafaei, Sahar Hassanzadeh
Tanha, Jafar
description Most real-world datasets usually contain imbalanced data. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. The overall prediction accuracy in imbalanced datasets is higher than 90%, while this accuracy is relatively lower for minority classes. In this paper, we first propose a new technique for under-sampling based on the Peak clustering method from the majority class on imbalanced datasets. We then propose a novel boosting-based algorithm for learning from imbalanced datasets, based on a combination of the proposed Peak under-sampling algorithm and over-sampling technique (SMOTE) in the boosting procedure, named OUBoost. In the proposed OUBoost algorithm, misclassified examples are not given equal weights. OUBoost selects useful examples from the majority class and creates synthetic examples for the minority class. In fact, it indirectly updates the weights of samples. We designed experiments using several evaluation metrics, such as Recall, MCC, Gmean, and F-score on 30 real-world imbalanced datasets. The results show improved prediction performance in the minority class in most used datasets using OUBoost. We further report time comparisons and statistical tests to analyze our proposed algorithm in more details.
doi_str_mv 10.1007/s13042-023-01839-0
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2919481984</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2919481984</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-c082005aadd278a5d28deda59ff6f2416873f9dab66075665a06344f7ec5c12f3</originalsourceid><addsrcrecordid>eNp9UE1LAzEQDaJgqf0DngKeVyfJbjbxpsUvKPRixVuY3STtlna3JlvBf2_aFb05lzfMvPdmeIRcMrhmAOVNZAJyngEXGTAldAYnZMSUVJkC9X7625fsnExiXEMqCUIAH5G3-eK-62J_S6sDNO2SVhidpd2nCxRbS_etTV3E7W5z2PauXrXNx95R3wW6SozjuNlWuMG2TkqLPV6QM4-b6CY_OCaLx4fX6XM2mz-9TO9mWS2Y7rMaFAcoEK3lpcLCcmWdxUJ7Lz3PmVSl8NpiJSWUhZQFghR57ktXFzXjXozJ1eC7C136KfZm3e1Dm04arpnOFdMqTyw-sOrQxRicN7vQbDF8GQbmEKEZIjQpQnOM0EASiUEUE7lduvBn_Y_qGyrVc68</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2919481984</pqid></control><display><type>article</type><title>OUBoost: boosting based over and under sampling technique for handling imbalanced data</title><source>SpringerLink Journals - AutoHoldings</source><source>ProQuest Central</source><creator>Mostafaei, Sahar Hassanzadeh ; Tanha, Jafar</creator><creatorcontrib>Mostafaei, Sahar Hassanzadeh ; Tanha, Jafar</creatorcontrib><description>Most real-world datasets usually contain imbalanced data. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. The overall prediction accuracy in imbalanced datasets is higher than 90%, while this accuracy is relatively lower for minority classes. In this paper, we first propose a new technique for under-sampling based on the Peak clustering method from the majority class on imbalanced datasets. We then propose a novel boosting-based algorithm for learning from imbalanced datasets, based on a combination of the proposed Peak under-sampling algorithm and over-sampling technique (SMOTE) in the boosting procedure, named OUBoost. In the proposed OUBoost algorithm, misclassified examples are not given equal weights. OUBoost selects useful examples from the majority class and creates synthetic examples for the minority class. In fact, it indirectly updates the weights of samples. We designed experiments using several evaluation metrics, such as Recall, MCC, Gmean, and F-score on 30 real-world imbalanced datasets. The results show improved prediction performance in the minority class in most used datasets using OUBoost. We further report time comparisons and statistical tests to analyze our proposed algorithm in more details.</description><identifier>ISSN: 1868-8071</identifier><identifier>EISSN: 1868-808X</identifier><identifier>DOI: 10.1007/s13042-023-01839-0</identifier><language>eng</language><publisher>Berlin/Heidelberg: Springer Berlin Heidelberg</publisher><subject>Accuracy ; Algorithms ; Artificial Intelligence ; Classification ; Clustering ; Complex Systems ; Computational Intelligence ; Control ; Data integrity ; Datasets ; Engineering ; Learning ; Machine learning ; Mechatronics ; Methods ; Original Article ; Pattern Recognition ; Performance evaluation ; Robotics ; Sampling methods ; Sampling techniques ; Statistical tests ; Systems Biology</subject><ispartof>International journal of machine learning and cybernetics, 2023-10, Vol.14 (10), p.3393-3411</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-c082005aadd278a5d28deda59ff6f2416873f9dab66075665a06344f7ec5c12f3</citedby><cites>FETCH-LOGICAL-c319t-c082005aadd278a5d28deda59ff6f2416873f9dab66075665a06344f7ec5c12f3</cites><orcidid>0000-0002-0779-6027</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s13042-023-01839-0$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2919481984?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,776,780,21367,27901,27902,33721,41464,42533,43781,51294</link.rule.ids></links><search><creatorcontrib>Mostafaei, Sahar Hassanzadeh</creatorcontrib><creatorcontrib>Tanha, Jafar</creatorcontrib><title>OUBoost: boosting based over and under sampling technique for handling imbalanced data</title><title>International journal of machine learning and cybernetics</title><addtitle>Int. J. Mach. Learn. &amp; Cyber</addtitle><description>Most real-world datasets usually contain imbalanced data. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. The overall prediction accuracy in imbalanced datasets is higher than 90%, while this accuracy is relatively lower for minority classes. In this paper, we first propose a new technique for under-sampling based on the Peak clustering method from the majority class on imbalanced datasets. We then propose a novel boosting-based algorithm for learning from imbalanced datasets, based on a combination of the proposed Peak under-sampling algorithm and over-sampling technique (SMOTE) in the boosting procedure, named OUBoost. In the proposed OUBoost algorithm, misclassified examples are not given equal weights. OUBoost selects useful examples from the majority class and creates synthetic examples for the minority class. In fact, it indirectly updates the weights of samples. We designed experiments using several evaluation metrics, such as Recall, MCC, Gmean, and F-score on 30 real-world imbalanced datasets. The results show improved prediction performance in the minority class in most used datasets using OUBoost. We further report time comparisons and statistical tests to analyze our proposed algorithm in more details.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Artificial Intelligence</subject><subject>Classification</subject><subject>Clustering</subject><subject>Complex Systems</subject><subject>Computational Intelligence</subject><subject>Control</subject><subject>Data integrity</subject><subject>Datasets</subject><subject>Engineering</subject><subject>Learning</subject><subject>Machine learning</subject><subject>Mechatronics</subject><subject>Methods</subject><subject>Original Article</subject><subject>Pattern Recognition</subject><subject>Performance evaluation</subject><subject>Robotics</subject><subject>Sampling methods</subject><subject>Sampling techniques</subject><subject>Statistical tests</subject><subject>Systems Biology</subject><issn>1868-8071</issn><issn>1868-808X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNp9UE1LAzEQDaJgqf0DngKeVyfJbjbxpsUvKPRixVuY3STtlna3JlvBf2_aFb05lzfMvPdmeIRcMrhmAOVNZAJyngEXGTAldAYnZMSUVJkC9X7625fsnExiXEMqCUIAH5G3-eK-62J_S6sDNO2SVhidpd2nCxRbS_etTV3E7W5z2PauXrXNx95R3wW6SozjuNlWuMG2TkqLPV6QM4-b6CY_OCaLx4fX6XM2mz-9TO9mWS2Y7rMaFAcoEK3lpcLCcmWdxUJ7Lz3PmVSl8NpiJSWUhZQFghR57ktXFzXjXozJ1eC7C136KfZm3e1Dm04arpnOFdMqTyw-sOrQxRicN7vQbDF8GQbmEKEZIjQpQnOM0EASiUEUE7lduvBn_Y_qGyrVc68</recordid><startdate>20231001</startdate><enddate>20231001</enddate><creator>Mostafaei, Sahar Hassanzadeh</creator><creator>Tanha, Jafar</creator><general>Springer Berlin Heidelberg</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PTHSS</scope><orcidid>https://orcid.org/0000-0002-0779-6027</orcidid></search><sort><creationdate>20231001</creationdate><title>OUBoost: boosting based over and under sampling technique for handling imbalanced data</title><author>Mostafaei, Sahar Hassanzadeh ; Tanha, Jafar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-c082005aadd278a5d28deda59ff6f2416873f9dab66075665a06344f7ec5c12f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Artificial Intelligence</topic><topic>Classification</topic><topic>Clustering</topic><topic>Complex Systems</topic><topic>Computational Intelligence</topic><topic>Control</topic><topic>Data integrity</topic><topic>Datasets</topic><topic>Engineering</topic><topic>Learning</topic><topic>Machine learning</topic><topic>Mechatronics</topic><topic>Methods</topic><topic>Original Article</topic><topic>Pattern Recognition</topic><topic>Performance evaluation</topic><topic>Robotics</topic><topic>Sampling methods</topic><topic>Sampling techniques</topic><topic>Statistical tests</topic><topic>Systems Biology</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mostafaei, Sahar Hassanzadeh</creatorcontrib><creatorcontrib>Tanha, Jafar</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>Engineering Collection</collection><jtitle>International journal of machine learning and cybernetics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mostafaei, Sahar Hassanzadeh</au><au>Tanha, Jafar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>OUBoost: boosting based over and under sampling technique for handling imbalanced data</atitle><jtitle>International journal of machine learning and cybernetics</jtitle><stitle>Int. J. Mach. Learn. &amp; Cyber</stitle><date>2023-10-01</date><risdate>2023</risdate><volume>14</volume><issue>10</issue><spage>3393</spage><epage>3411</epage><pages>3393-3411</pages><issn>1868-8071</issn><eissn>1868-808X</eissn><abstract>Most real-world datasets usually contain imbalanced data. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. The overall prediction accuracy in imbalanced datasets is higher than 90%, while this accuracy is relatively lower for minority classes. In this paper, we first propose a new technique for under-sampling based on the Peak clustering method from the majority class on imbalanced datasets. We then propose a novel boosting-based algorithm for learning from imbalanced datasets, based on a combination of the proposed Peak under-sampling algorithm and over-sampling technique (SMOTE) in the boosting procedure, named OUBoost. In the proposed OUBoost algorithm, misclassified examples are not given equal weights. OUBoost selects useful examples from the majority class and creates synthetic examples for the minority class. In fact, it indirectly updates the weights of samples. We designed experiments using several evaluation metrics, such as Recall, MCC, Gmean, and F-score on 30 real-world imbalanced datasets. The results show improved prediction performance in the minority class in most used datasets using OUBoost. We further report time comparisons and statistical tests to analyze our proposed algorithm in more details.</abstract><cop>Berlin/Heidelberg</cop><pub>Springer Berlin Heidelberg</pub><doi>10.1007/s13042-023-01839-0</doi><tpages>19</tpages><orcidid>https://orcid.org/0000-0002-0779-6027</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1868-8071
ispartof International journal of machine learning and cybernetics, 2023-10, Vol.14 (10), p.3393-3411
issn 1868-8071
1868-808X
language eng
recordid cdi_proquest_journals_2919481984
source SpringerLink Journals - AutoHoldings; ProQuest Central
subjects Accuracy
Algorithms
Artificial Intelligence
Classification
Clustering
Complex Systems
Computational Intelligence
Control
Data integrity
Datasets
Engineering
Learning
Machine learning
Mechatronics
Methods
Original Article
Pattern Recognition
Performance evaluation
Robotics
Sampling methods
Sampling techniques
Statistical tests
Systems Biology
title OUBoost: boosting based over and under sampling technique for handling imbalanced data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T19%3A28%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=OUBoost:%20boosting%20based%20over%20and%20under%20sampling%20technique%20for%20handling%20imbalanced%20data&rft.jtitle=International%20journal%20of%20machine%20learning%20and%20cybernetics&rft.au=Mostafaei,%20Sahar%20Hassanzadeh&rft.date=2023-10-01&rft.volume=14&rft.issue=10&rft.spage=3393&rft.epage=3411&rft.pages=3393-3411&rft.issn=1868-8071&rft.eissn=1868-808X&rft_id=info:doi/10.1007/s13042-023-01839-0&rft_dat=%3Cproquest_cross%3E2919481984%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2919481984&rft_id=info:pmid/&rfr_iscdi=true