A Classification Model For Class Imbalance Dataset Using Genetic Programming

Since the last few decades, a class imbalance has been one of the most challenging problems in various fields, such as data mining and machine learning. The particular state of an imbalanced dataset, where each class associated with a given dataset is distributed unevenly. This happens when the posi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2019, Vol.7, p.71013-71037
Hauptverfasser: Tahir, Mirza Amaad Ul Haq, Asghar, Sohail, Manzoor, Awais, Noor, Muhammad Asim
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 71037
container_issue
container_start_page 71013
container_title IEEE access
container_volume 7
creator Tahir, Mirza Amaad Ul Haq
Asghar, Sohail
Manzoor, Awais
Noor, Muhammad Asim
description Since the last few decades, a class imbalance has been one of the most challenging problems in various fields, such as data mining and machine learning. The particular state of an imbalanced dataset, where each class associated with a given dataset is distributed unevenly. This happens when the positive class is much smaller than the negative class. In this case, most standard classification algorithms do not identify examples related to the positive class. A positive class usually refers to the key interest of the classification task. In order to solve this problem, several solutions were proposed such as sampling-based over-sampling and under-sampling, changes at the classifier level or the combination of two or more classifiers. However the main problem is that most solutions are biased towards negative class, computationally expensive, have storage issues or taking long training time. An alternative approach to this problem is the genetic algorithm (GA), which has shown the promising results. The GA is an evolutionary learning algorithm that uses the principles of Darwinian evolution, it is a powerful global search algorithm. Moreover, the fitness function is a key parameter in GA. It determines how well a solution can solve the given problem. In this paper, we propose a solution which uses entropy and information gain as a fitness function in GA with an objective to improve the impurity and gives a more balanced result without changing the original dataset. The experiments conducted on different datasets demonstrate the effectiveness of the proposed solution in comparison with the several other state-of-the-art algorithms in term of Accuracy (Acc), geometric mean (GM), F-measure (FM), kappa, and Matthews correlation coefficient (MCC).
doi_str_mv 10.1109/ACCESS.2019.2915611
format Article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_proquest_journals_2455617411</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8709798</ieee_id><doaj_id>oai_doaj_org_article_e4a6a1084c9f4607a8a5d91dfa36c6d7</doaj_id><sourcerecordid>2455617411</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-cf1de3c0be6384578fbeab367f8904ace7d3e907dc1e94cebed37e3541dfe35b3</originalsourceid><addsrcrecordid>eNpNUU1PwzAMrRBIIOAX7FKJ80bSpPk4TmXApCGQYOfITdypU9tAUg78ezI6Tfhi69nv2fLLshklC0qJvl9W1er9fVEQqheFpqWg9Cy7KqjQc1Yycf6vvsxuY9yTFCpBpbzKNsu86iDGtmktjK0f8hfvsMsffZga-bqvoYPBYv4AI0Qc821sh13-hAOOrc3fgt8F6PuE3WQXDXQRb4_5Ots-rj6q5_nm9WldLTdzy4ka57ahDpklNQqmeClVUyPUTMhGacLBonQMNZHOUtTcYo2OSWQlp65JqWbX2XrSdR725jO0PYQf46E1f4APOwMh3dahQQ4CKFHc6oYLIkFB6XQSAiascDJp3U1an8F_fWMczd5_hyGdbwpepmdKTmmaYtOUDT7GgM1pKyXm4IKZXDAHF8zRhcSaTawWEU8MJYmWWrFfOi-DOA</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2455617411</pqid></control><display><type>article</type><title>A Classification Model For Class Imbalance Dataset Using Genetic Programming</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Tahir, Mirza Amaad Ul Haq ; Asghar, Sohail ; Manzoor, Awais ; Noor, Muhammad Asim</creator><creatorcontrib>Tahir, Mirza Amaad Ul Haq ; Asghar, Sohail ; Manzoor, Awais ; Noor, Muhammad Asim</creatorcontrib><description>Since the last few decades, a class imbalance has been one of the most challenging problems in various fields, such as data mining and machine learning. The particular state of an imbalanced dataset, where each class associated with a given dataset is distributed unevenly. This happens when the positive class is much smaller than the negative class. In this case, most standard classification algorithms do not identify examples related to the positive class. A positive class usually refers to the key interest of the classification task. In order to solve this problem, several solutions were proposed such as sampling-based over-sampling and under-sampling, changes at the classifier level or the combination of two or more classifiers. However the main problem is that most solutions are biased towards negative class, computationally expensive, have storage issues or taking long training time. An alternative approach to this problem is the genetic algorithm (GA), which has shown the promising results. The GA is an evolutionary learning algorithm that uses the principles of Darwinian evolution, it is a powerful global search algorithm. Moreover, the fitness function is a key parameter in GA. It determines how well a solution can solve the given problem. In this paper, we propose a solution which uses entropy and information gain as a fitness function in GA with an objective to improve the impurity and gives a more balanced result without changing the original dataset. The experiments conducted on different datasets demonstrate the effectiveness of the proposed solution in comparison with the several other state-of-the-art algorithms in term of Accuracy (Acc), geometric mean (GM), F-measure (FM), kappa, and Matthews correlation coefficient (MCC).</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2019.2915611</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Algorithms ; Classification ; Classifiers ; Computational modeling ; Correlation analysis ; Correlation coefficients ; Data mining ; Datasets ; Entropy ; Entropy of solution ; Evolutionary algorithms ; Fitness ; fitness function ; genetic algorithm ; Genetic algorithms ; Geometric accuracy ; Imbalanced dataset ; Impurities ; Information gain ; Machine learning ; Sampling ; Search algorithms ; Support vector machines ; Training</subject><ispartof>IEEE access, 2019, Vol.7, p.71013-71037</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-cf1de3c0be6384578fbeab367f8904ace7d3e907dc1e94cebed37e3541dfe35b3</citedby><cites>FETCH-LOGICAL-c408t-cf1de3c0be6384578fbeab367f8904ace7d3e907dc1e94cebed37e3541dfe35b3</cites><orcidid>0000-0001-6306-9150 ; 0000-0002-7678-8282</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8709798$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,864,2102,4024,27633,27923,27924,27925,54933</link.rule.ids></links><search><creatorcontrib>Tahir, Mirza Amaad Ul Haq</creatorcontrib><creatorcontrib>Asghar, Sohail</creatorcontrib><creatorcontrib>Manzoor, Awais</creatorcontrib><creatorcontrib>Noor, Muhammad Asim</creatorcontrib><title>A Classification Model For Class Imbalance Dataset Using Genetic Programming</title><title>IEEE access</title><addtitle>Access</addtitle><description>Since the last few decades, a class imbalance has been one of the most challenging problems in various fields, such as data mining and machine learning. The particular state of an imbalanced dataset, where each class associated with a given dataset is distributed unevenly. This happens when the positive class is much smaller than the negative class. In this case, most standard classification algorithms do not identify examples related to the positive class. A positive class usually refers to the key interest of the classification task. In order to solve this problem, several solutions were proposed such as sampling-based over-sampling and under-sampling, changes at the classifier level or the combination of two or more classifiers. However the main problem is that most solutions are biased towards negative class, computationally expensive, have storage issues or taking long training time. An alternative approach to this problem is the genetic algorithm (GA), which has shown the promising results. The GA is an evolutionary learning algorithm that uses the principles of Darwinian evolution, it is a powerful global search algorithm. Moreover, the fitness function is a key parameter in GA. It determines how well a solution can solve the given problem. In this paper, we propose a solution which uses entropy and information gain as a fitness function in GA with an objective to improve the impurity and gives a more balanced result without changing the original dataset. The experiments conducted on different datasets demonstrate the effectiveness of the proposed solution in comparison with the several other state-of-the-art algorithms in term of Accuracy (Acc), geometric mean (GM), F-measure (FM), kappa, and Matthews correlation coefficient (MCC).</description><subject>Algorithms</subject><subject>Classification</subject><subject>Classifiers</subject><subject>Computational modeling</subject><subject>Correlation analysis</subject><subject>Correlation coefficients</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Entropy</subject><subject>Entropy of solution</subject><subject>Evolutionary algorithms</subject><subject>Fitness</subject><subject>fitness function</subject><subject>genetic algorithm</subject><subject>Genetic algorithms</subject><subject>Geometric accuracy</subject><subject>Imbalanced dataset</subject><subject>Impurities</subject><subject>Information gain</subject><subject>Machine learning</subject><subject>Sampling</subject><subject>Search algorithms</subject><subject>Support vector machines</subject><subject>Training</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUU1PwzAMrRBIIOAX7FKJ80bSpPk4TmXApCGQYOfITdypU9tAUg78ezI6Tfhi69nv2fLLshklC0qJvl9W1er9fVEQqheFpqWg9Cy7KqjQc1Yycf6vvsxuY9yTFCpBpbzKNsu86iDGtmktjK0f8hfvsMsffZga-bqvoYPBYv4AI0Qc821sh13-hAOOrc3fgt8F6PuE3WQXDXQRb4_5Ots-rj6q5_nm9WldLTdzy4ka57ahDpklNQqmeClVUyPUTMhGacLBonQMNZHOUtTcYo2OSWQlp65JqWbX2XrSdR725jO0PYQf46E1f4APOwMh3dahQQ4CKFHc6oYLIkFB6XQSAiascDJp3U1an8F_fWMczd5_hyGdbwpepmdKTmmaYtOUDT7GgM1pKyXm4IKZXDAHF8zRhcSaTawWEU8MJYmWWrFfOi-DOA</recordid><startdate>2019</startdate><enddate>2019</enddate><creator>Tahir, Mirza Amaad Ul Haq</creator><creator>Asghar, Sohail</creator><creator>Manzoor, Awais</creator><creator>Noor, Muhammad Asim</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0001-6306-9150</orcidid><orcidid>https://orcid.org/0000-0002-7678-8282</orcidid></search><sort><creationdate>2019</creationdate><title>A Classification Model For Class Imbalance Dataset Using Genetic Programming</title><author>Tahir, Mirza Amaad Ul Haq ; Asghar, Sohail ; Manzoor, Awais ; Noor, Muhammad Asim</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-cf1de3c0be6384578fbeab367f8904ace7d3e907dc1e94cebed37e3541dfe35b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Algorithms</topic><topic>Classification</topic><topic>Classifiers</topic><topic>Computational modeling</topic><topic>Correlation analysis</topic><topic>Correlation coefficients</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Entropy</topic><topic>Entropy of solution</topic><topic>Evolutionary algorithms</topic><topic>Fitness</topic><topic>fitness function</topic><topic>genetic algorithm</topic><topic>Genetic algorithms</topic><topic>Geometric accuracy</topic><topic>Imbalanced dataset</topic><topic>Impurities</topic><topic>Information gain</topic><topic>Machine learning</topic><topic>Sampling</topic><topic>Search algorithms</topic><topic>Support vector machines</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tahir, Mirza Amaad Ul Haq</creatorcontrib><creatorcontrib>Asghar, Sohail</creatorcontrib><creatorcontrib>Manzoor, Awais</creatorcontrib><creatorcontrib>Noor, Muhammad Asim</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tahir, Mirza Amaad Ul Haq</au><au>Asghar, Sohail</au><au>Manzoor, Awais</au><au>Noor, Muhammad Asim</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Classification Model For Class Imbalance Dataset Using Genetic Programming</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2019</date><risdate>2019</risdate><volume>7</volume><spage>71013</spage><epage>71037</epage><pages>71013-71037</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Since the last few decades, a class imbalance has been one of the most challenging problems in various fields, such as data mining and machine learning. The particular state of an imbalanced dataset, where each class associated with a given dataset is distributed unevenly. This happens when the positive class is much smaller than the negative class. In this case, most standard classification algorithms do not identify examples related to the positive class. A positive class usually refers to the key interest of the classification task. In order to solve this problem, several solutions were proposed such as sampling-based over-sampling and under-sampling, changes at the classifier level or the combination of two or more classifiers. However the main problem is that most solutions are biased towards negative class, computationally expensive, have storage issues or taking long training time. An alternative approach to this problem is the genetic algorithm (GA), which has shown the promising results. The GA is an evolutionary learning algorithm that uses the principles of Darwinian evolution, it is a powerful global search algorithm. Moreover, the fitness function is a key parameter in GA. It determines how well a solution can solve the given problem. In this paper, we propose a solution which uses entropy and information gain as a fitness function in GA with an objective to improve the impurity and gives a more balanced result without changing the original dataset. The experiments conducted on different datasets demonstrate the effectiveness of the proposed solution in comparison with the several other state-of-the-art algorithms in term of Accuracy (Acc), geometric mean (GM), F-measure (FM), kappa, and Matthews correlation coefficient (MCC).</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2019.2915611</doi><tpages>25</tpages><orcidid>https://orcid.org/0000-0001-6306-9150</orcidid><orcidid>https://orcid.org/0000-0002-7678-8282</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2019, Vol.7, p.71013-71037
issn 2169-3536
2169-3536
language eng
recordid cdi_proquest_journals_2455617411
source IEEE Open Access Journals; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals
subjects Algorithms
Classification
Classifiers
Computational modeling
Correlation analysis
Correlation coefficients
Data mining
Datasets
Entropy
Entropy of solution
Evolutionary algorithms
Fitness
fitness function
genetic algorithm
Genetic algorithms
Geometric accuracy
Imbalanced dataset
Impurities
Information gain
Machine learning
Sampling
Search algorithms
Support vector machines
Training
title A Classification Model For Class Imbalance Dataset Using Genetic Programming
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T01%3A54%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Classification%20Model%20For%20Class%20Imbalance%20Dataset%20Using%20Genetic%20Programming&rft.jtitle=IEEE%20access&rft.au=Tahir,%20Mirza%20Amaad%20Ul%20Haq&rft.date=2019&rft.volume=7&rft.spage=71013&rft.epage=71037&rft.pages=71013-71037&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2019.2915611&rft_dat=%3Cproquest_ieee_%3E2455617411%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2455617411&rft_id=info:pmid/&rft_ieee_id=8709798&rft_doaj_id=oai_doaj_org_article_e4a6a1084c9f4607a8a5d91dfa36c6d7&rfr_iscdi=true