RACOG and wRACOG: Two Probabilistic Oversampling Techniques

As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing oversampling approaches for addressing th...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering 2015-01, Vol.27 (1), p.222-234
Hauptverfasser: Das, Barnan, Krishnan, Narayanan C., Cook, Diane J.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 234
container_issue 1
container_start_page 222
container_title IEEE transactions on knowledge and data engineering
container_volume 27
creator Das, Barnan
Krishnan, Narayanan C.
Cook, Diane J.
description As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing oversampling approaches for addressing this problem typically do not consider the probability distribution of the minority class while synthetically generating new samples. As a result, the minority class is not represented well which leads to high misclassification error. We introduce two probabilistic oversampling approaches, namely RACOG and wRACOG, to synthetically generating and strategically selecting new minority class samples. The proposed approaches use the joint probability distribution of data attributes and Gibbs sampling to generate new minority class samples. While RACOG selects samples produced by the Gibbs sampler based on a predefined lag, wRACOG selects those samples that have the highest probability of being misclassified by the existing learning model. We validate our approach using nine UCI data sets that were carefully modified to exhibit class imbalance and one new application domain data set with inherent extreme class imbalance. In addition, we compare the classification performance of the proposed methods with three other existing resampling techniques.
doi_str_mv 10.1109/TKDE.2014.2324567
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4814938</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6816044</ieee_id><sourcerecordid>1835675476</sourcerecordid><originalsourceid>FETCH-LOGICAL-c485t-1997407111a899ae1f5c4775c62376c7289079ce4751e3e0d43f8f835f8547553</originalsourceid><addsrcrecordid>eNpVkFtLAzEQhYMoVqs_QATZR1-2ZjbJJlEQSq1VLFRkfQ5pmm0je6mbXvDfm9pa9ClDzplzhg-hC8AdACxvspeHfifBQDsJSShL-QE6AcZEnICEwzBjCjEllLfQqfcfGGPBBRyjVsKDIjk9QXdv3d5oEOlqEq1_xtsoW9fRa1OP9dgVzi-ciUYr23hdzgtXTaPMmlnlPpfWn6GjXBfenu_eNnp_7Ge9p3g4Gjz3usPYUMEWMcjQhDkAaCGltpAzQzlnJk0ITw1PhMRcGks5A0ssnlCSi1wQlgsW_hhpo_tt7nw5Lu3E2GrR6ELNG1fq5kvV2qn_SuVmalqvFBVAJREh4HoX0NSbwxeqdN7YotCVrZdeQShLeShLgxW2VtPU3jc239cAVhvoagNdbaCrHfSwc_X3vv3GL-VguNwanLV2L6cCUkwp-QbLP4Pf</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1835675476</pqid></control><display><type>article</type><title>RACOG and wRACOG: Two Probabilistic Oversampling Techniques</title><source>IEEE Electronic Library (IEL)</source><creator>Das, Barnan ; Krishnan, Narayanan C. ; Cook, Diane J.</creator><creatorcontrib>Das, Barnan ; Krishnan, Narayanan C. ; Cook, Diane J.</creatorcontrib><description>As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing oversampling approaches for addressing this problem typically do not consider the probability distribution of the minority class while synthetically generating new samples. As a result, the minority class is not represented well which leads to high misclassification error. We introduce two probabilistic oversampling approaches, namely RACOG and wRACOG, to synthetically generating and strategically selecting new minority class samples. The proposed approaches use the joint probability distribution of data attributes and Gibbs sampling to generate new minority class samples. While RACOG selects samples produced by the Gibbs sampler based on a predefined lag, wRACOG selects those samples that have the highest probability of being misclassified by the existing learning model. We validate our approach using nine UCI data sets that were carefully modified to exhibit class imbalance and one new application domain data set with inherent extreme class imbalance. In addition, we compare the classification performance of the proposed methods with three other existing resampling techniques.</description><identifier>ISSN: 1041-4347</identifier><identifier>EISSN: 1558-2191</identifier><identifier>DOI: 10.1109/TKDE.2014.2324567</identifier><identifier>PMID: 27041974</identifier><identifier>CODEN: ITKEEH</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Approximation algorithms ; Approximation methods ; Joints ; Kernel ; Machine learning algorithms ; Probabilistic logic ; Probability distribution</subject><ispartof>IEEE transactions on knowledge and data engineering, 2015-01, Vol.27 (1), p.222-234</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c485t-1997407111a899ae1f5c4775c62376c7289079ce4751e3e0d43f8f835f8547553</citedby><cites>FETCH-LOGICAL-c485t-1997407111a899ae1f5c4775c62376c7289079ce4751e3e0d43f8f835f8547553</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6816044$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,314,780,784,796,885,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6816044$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/27041974$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Das, Barnan</creatorcontrib><creatorcontrib>Krishnan, Narayanan C.</creatorcontrib><creatorcontrib>Cook, Diane J.</creatorcontrib><title>RACOG and wRACOG: Two Probabilistic Oversampling Techniques</title><title>IEEE transactions on knowledge and data engineering</title><addtitle>TKDE</addtitle><addtitle>IEEE Trans Knowl Data Eng</addtitle><description>As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing oversampling approaches for addressing this problem typically do not consider the probability distribution of the minority class while synthetically generating new samples. As a result, the minority class is not represented well which leads to high misclassification error. We introduce two probabilistic oversampling approaches, namely RACOG and wRACOG, to synthetically generating and strategically selecting new minority class samples. The proposed approaches use the joint probability distribution of data attributes and Gibbs sampling to generate new minority class samples. While RACOG selects samples produced by the Gibbs sampler based on a predefined lag, wRACOG selects those samples that have the highest probability of being misclassified by the existing learning model. We validate our approach using nine UCI data sets that were carefully modified to exhibit class imbalance and one new application domain data set with inherent extreme class imbalance. In addition, we compare the classification performance of the proposed methods with three other existing resampling techniques.</description><subject>Approximation algorithms</subject><subject>Approximation methods</subject><subject>Joints</subject><subject>Kernel</subject><subject>Machine learning algorithms</subject><subject>Probabilistic logic</subject><subject>Probability distribution</subject><issn>1041-4347</issn><issn>1558-2191</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpVkFtLAzEQhYMoVqs_QATZR1-2ZjbJJlEQSq1VLFRkfQ5pmm0je6mbXvDfm9pa9ClDzplzhg-hC8AdACxvspeHfifBQDsJSShL-QE6AcZEnICEwzBjCjEllLfQqfcfGGPBBRyjVsKDIjk9QXdv3d5oEOlqEq1_xtsoW9fRa1OP9dgVzi-ciUYr23hdzgtXTaPMmlnlPpfWn6GjXBfenu_eNnp_7Ge9p3g4Gjz3usPYUMEWMcjQhDkAaCGltpAzQzlnJk0ITw1PhMRcGks5A0ssnlCSi1wQlgsW_hhpo_tt7nw5Lu3E2GrR6ELNG1fq5kvV2qn_SuVmalqvFBVAJREh4HoX0NSbwxeqdN7YotCVrZdeQShLeShLgxW2VtPU3jc239cAVhvoagNdbaCrHfSwc_X3vv3GL-VguNwanLV2L6cCUkwp-QbLP4Pf</recordid><startdate>20150101</startdate><enddate>20150101</enddate><creator>Das, Barnan</creator><creator>Krishnan, Narayanan C.</creator><creator>Cook, Diane J.</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20150101</creationdate><title>RACOG and wRACOG: Two Probabilistic Oversampling Techniques</title><author>Das, Barnan ; Krishnan, Narayanan C. ; Cook, Diane J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c485t-1997407111a899ae1f5c4775c62376c7289079ce4751e3e0d43f8f835f8547553</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Approximation algorithms</topic><topic>Approximation methods</topic><topic>Joints</topic><topic>Kernel</topic><topic>Machine learning algorithms</topic><topic>Probabilistic logic</topic><topic>Probability distribution</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Das, Barnan</creatorcontrib><creatorcontrib>Krishnan, Narayanan C.</creatorcontrib><creatorcontrib>Cook, Diane J.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>IEEE transactions on knowledge and data engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Das, Barnan</au><au>Krishnan, Narayanan C.</au><au>Cook, Diane J.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>RACOG and wRACOG: Two Probabilistic Oversampling Techniques</atitle><jtitle>IEEE transactions on knowledge and data engineering</jtitle><stitle>TKDE</stitle><addtitle>IEEE Trans Knowl Data Eng</addtitle><date>2015-01-01</date><risdate>2015</risdate><volume>27</volume><issue>1</issue><spage>222</spage><epage>234</epage><pages>222-234</pages><issn>1041-4347</issn><eissn>1558-2191</eissn><coden>ITKEEH</coden><abstract>As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing oversampling approaches for addressing this problem typically do not consider the probability distribution of the minority class while synthetically generating new samples. As a result, the minority class is not represented well which leads to high misclassification error. We introduce two probabilistic oversampling approaches, namely RACOG and wRACOG, to synthetically generating and strategically selecting new minority class samples. The proposed approaches use the joint probability distribution of data attributes and Gibbs sampling to generate new minority class samples. While RACOG selects samples produced by the Gibbs sampler based on a predefined lag, wRACOG selects those samples that have the highest probability of being misclassified by the existing learning model. We validate our approach using nine UCI data sets that were carefully modified to exhibit class imbalance and one new application domain data set with inherent extreme class imbalance. In addition, we compare the classification performance of the proposed methods with three other existing resampling techniques.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>27041974</pmid><doi>10.1109/TKDE.2014.2324567</doi><tpages>13</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1041-4347
ispartof IEEE transactions on knowledge and data engineering, 2015-01, Vol.27 (1), p.222-234
issn 1041-4347
1558-2191
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4814938
source IEEE Electronic Library (IEL)
subjects Approximation algorithms
Approximation methods
Joints
Kernel
Machine learning algorithms
Probabilistic logic
Probability distribution
title RACOG and wRACOG: Two Probabilistic Oversampling Techniques
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T23%3A47%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=RACOG%20and%20wRACOG:%20Two%20Probabilistic%20Oversampling%20Techniques&rft.jtitle=IEEE%20transactions%20on%20knowledge%20and%20data%20engineering&rft.au=Das,%20Barnan&rft.date=2015-01-01&rft.volume=27&rft.issue=1&rft.spage=222&rft.epage=234&rft.pages=222-234&rft.issn=1041-4347&rft.eissn=1558-2191&rft.coden=ITKEEH&rft_id=info:doi/10.1109/TKDE.2014.2324567&rft_dat=%3Cproquest_RIE%3E1835675476%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1835675476&rft_id=info:pmid/27041974&rft_ieee_id=6816044&rfr_iscdi=true