A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer
World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis include...
Gespeichert in:
Veröffentlicht in: | International journal of intelligent systems and applications 2021-08, Vol.13 (4), p.24-37 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 37 |
---|---|
container_issue | 4 |
container_start_page | 24 |
container_title | International journal of intelligent systems and applications |
container_volume | 13 |
creator | Chaudhuri, Avijit Kumar Banerjee, Dilip K. Das, Anirban |
description | World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold cross-validations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works. Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa sta |
doi_str_mv | 10.5815/ijisa.2021.04.03 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2798550521</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2798550521</sourcerecordid><originalsourceid>FETCH-LOGICAL-c1113-bee689a9ce85cd90f1f61f25ac84ae5a23dd23cf3d32bde69ad3661d4404ac013</originalsourceid><addsrcrecordid>eNo9kM1LAzEQxYMoWGrvHgOed518bbPH2loVWjxUwVuYJrOwte7WJD3437tV8V3mwXszAz_GrgWUxgpz2-7ahKUEKUrQJagzNpIw1UUNxp7_e_12ySYp7WBQZbUV9YitZ3yBGRNlPqcux9bzJWE-RuIb2pPPbd9x7ALfZPTvFPi6D7TnuecLykPM7yJhGpax8xSv2EWD-0STvzlmr8v7l_ljsXp-eJrPVoUXQqhiS1TZGmtP1vhQQyOaSjTSoLcayaBUIUjlGxWU3AaqagyqqkTQGjR6EGrMbn7vHmL_eaSU3a4_xm546eS0tsaAkacW_LZ87FOK1LhDbD8wfjkB7sTN_XBzJ24OtAOlvgE7qmCq</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2798550521</pqid></control><display><type>article</type><title>A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer</title><source>EZB-FREE-00999 freely available EZB journals</source><creator>Chaudhuri, Avijit Kumar ; Banerjee, Dilip K. ; Das, Anirban</creator><creatorcontrib>Chaudhuri, Avijit Kumar ; Banerjee, Dilip K. ; Das, Anirban ; Research Scholar, Department of Computer Application, SEACOM SKILLS UNIVERSITY, Kendradangal, Bolpur, Dist:Birbhum, PIN - 731 236, West Bengal</creatorcontrib><description>World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold cross-validations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works. Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa statistics – to check the strength of the proposed DCA classifier. Various splits of training and testing data – namely, 50–50%, 66–34%, 80–20% and 10-fold cross-validation – have been incorporated in this research to test the credibility of the classification models in handling the unbalanced data. Finally, the proposed DCA model demonstrated its potential with almost 100% performance in the classification results. The output results have also been compared with other research on the same dataset where the proposed classifiers were found to be best across all the performance dimensions.</description><identifier>ISSN: 2074-904X</identifier><identifier>EISSN: 2074-9058</identifier><identifier>DOI: 10.5815/ijisa.2021.04.03</identifier><language>eng</language><publisher>Hong Kong: Modern Education and Computer Science Press</publisher><subject>Accuracy ; Breast cancer ; Classification ; Classifiers ; Data mining ; Datasets ; Decision trees ; Fatalities ; Genetic algorithms ; Identification methods ; Kernel functions ; Machine learning ; Model accuracy ; Optimization ; Performance measurement ; Principles ; Radial basis function ; Rank tests ; Statistical analysis ; Statistical tests ; Support vector machines ; Training</subject><ispartof>International journal of intelligent systems and applications, 2021-08, Vol.13 (4), p.24-37</ispartof><rights>2021. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the associated terms available at http://www.mecs-press.org/ijcnis/terms.html</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>315,782,786,27933,27934</link.rule.ids></links><search><creatorcontrib>Chaudhuri, Avijit Kumar</creatorcontrib><creatorcontrib>Banerjee, Dilip K.</creatorcontrib><creatorcontrib>Das, Anirban</creatorcontrib><creatorcontrib>Research Scholar, Department of Computer Application, SEACOM SKILLS UNIVERSITY, Kendradangal, Bolpur, Dist:Birbhum, PIN - 731 236, West Bengal</creatorcontrib><title>A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer</title><title>International journal of intelligent systems and applications</title><description>World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold cross-validations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works. Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa statistics – to check the strength of the proposed DCA classifier. Various splits of training and testing data – namely, 50–50%, 66–34%, 80–20% and 10-fold cross-validation – have been incorporated in this research to test the credibility of the classification models in handling the unbalanced data. Finally, the proposed DCA model demonstrated its potential with almost 100% performance in the classification results. The output results have also been compared with other research on the same dataset where the proposed classifiers were found to be best across all the performance dimensions.</description><subject>Accuracy</subject><subject>Breast cancer</subject><subject>Classification</subject><subject>Classifiers</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Decision trees</subject><subject>Fatalities</subject><subject>Genetic algorithms</subject><subject>Identification methods</subject><subject>Kernel functions</subject><subject>Machine learning</subject><subject>Model accuracy</subject><subject>Optimization</subject><subject>Performance measurement</subject><subject>Principles</subject><subject>Radial basis function</subject><subject>Rank tests</subject><subject>Statistical analysis</subject><subject>Statistical tests</subject><subject>Support vector machines</subject><subject>Training</subject><issn>2074-904X</issn><issn>2074-9058</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNo9kM1LAzEQxYMoWGrvHgOed518bbPH2loVWjxUwVuYJrOwte7WJD3437tV8V3mwXszAz_GrgWUxgpz2-7ahKUEKUrQJagzNpIw1UUNxp7_e_12ySYp7WBQZbUV9YitZ3yBGRNlPqcux9bzJWE-RuIb2pPPbd9x7ALfZPTvFPi6D7TnuecLykPM7yJhGpax8xSv2EWD-0STvzlmr8v7l_ljsXp-eJrPVoUXQqhiS1TZGmtP1vhQQyOaSjTSoLcayaBUIUjlGxWU3AaqagyqqkTQGjR6EGrMbn7vHmL_eaSU3a4_xm546eS0tsaAkacW_LZ87FOK1LhDbD8wfjkB7sTN_XBzJ24OtAOlvgE7qmCq</recordid><startdate>20210808</startdate><enddate>20210808</enddate><creator>Chaudhuri, Avijit Kumar</creator><creator>Banerjee, Dilip K.</creator><creator>Das, Anirban</creator><general>Modern Education and Computer Science Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7XB</scope><scope>8AL</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BVBZV</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>M0N</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope></search><sort><creationdate>20210808</creationdate><title>A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer</title><author>Chaudhuri, Avijit Kumar ; Banerjee, Dilip K. ; Das, Anirban</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c1113-bee689a9ce85cd90f1f61f25ac84ae5a23dd23cf3d32bde69ad3661d4404ac013</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Accuracy</topic><topic>Breast cancer</topic><topic>Classification</topic><topic>Classifiers</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Decision trees</topic><topic>Fatalities</topic><topic>Genetic algorithms</topic><topic>Identification methods</topic><topic>Kernel functions</topic><topic>Machine learning</topic><topic>Model accuracy</topic><topic>Optimization</topic><topic>Performance measurement</topic><topic>Principles</topic><topic>Radial basis function</topic><topic>Rank tests</topic><topic>Statistical analysis</topic><topic>Statistical tests</topic><topic>Support vector machines</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Chaudhuri, Avijit Kumar</creatorcontrib><creatorcontrib>Banerjee, Dilip K.</creatorcontrib><creatorcontrib>Das, Anirban</creatorcontrib><creatorcontrib>Research Scholar, Department of Computer Application, SEACOM SKILLS UNIVERSITY, Kendradangal, Bolpur, Dist:Birbhum, PIN - 731 236, West Bengal</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>East & South Asia Database</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Computing Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>International journal of intelligent systems and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chaudhuri, Avijit Kumar</au><au>Banerjee, Dilip K.</au><au>Das, Anirban</au><aucorp>Research Scholar, Department of Computer Application, SEACOM SKILLS UNIVERSITY, Kendradangal, Bolpur, Dist:Birbhum, PIN - 731 236, West Bengal</aucorp><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer</atitle><jtitle>International journal of intelligent systems and applications</jtitle><date>2021-08-08</date><risdate>2021</risdate><volume>13</volume><issue>4</issue><spage>24</spage><epage>37</epage><pages>24-37</pages><issn>2074-904X</issn><eissn>2074-9058</eissn><abstract>World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold cross-validations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works. Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa statistics – to check the strength of the proposed DCA classifier. Various splits of training and testing data – namely, 50–50%, 66–34%, 80–20% and 10-fold cross-validation – have been incorporated in this research to test the credibility of the classification models in handling the unbalanced data. Finally, the proposed DCA model demonstrated its potential with almost 100% performance in the classification results. The output results have also been compared with other research on the same dataset where the proposed classifiers were found to be best across all the performance dimensions.</abstract><cop>Hong Kong</cop><pub>Modern Education and Computer Science Press</pub><doi>10.5815/ijisa.2021.04.03</doi><tpages>14</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2074-904X |
ispartof | International journal of intelligent systems and applications, 2021-08, Vol.13 (4), p.24-37 |
issn | 2074-904X 2074-9058 |
language | eng |
recordid | cdi_proquest_journals_2798550521 |
source | EZB-FREE-00999 freely available EZB journals |
subjects | Accuracy Breast cancer Classification Classifiers Data mining Datasets Decision trees Fatalities Genetic algorithms Identification methods Kernel functions Machine learning Model accuracy Optimization Performance measurement Principles Radial basis function Rank tests Statistical analysis Statistical tests Support vector machines Training |
title | A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-03T11%3A45%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Dataset%20Centric%20Feature%20Selection%20and%20Stacked%20Model%20to%20Detect%20Breast%20Cancer&rft.jtitle=International%20journal%20of%20intelligent%20systems%20and%20applications&rft.au=Chaudhuri,%20Avijit%20Kumar&rft.aucorp=Research%20Scholar,%20Department%20of%20Computer%20Application,%20SEACOM%20SKILLS%20UNIVERSITY,%20Kendradangal,%20Bolpur,%20Dist:Birbhum,%20PIN%20-%20731%20236,%20West%20Bengal&rft.date=2021-08-08&rft.volume=13&rft.issue=4&rft.spage=24&rft.epage=37&rft.pages=24-37&rft.issn=2074-904X&rft.eissn=2074-9058&rft_id=info:doi/10.5815/ijisa.2021.04.03&rft_dat=%3Cproquest_cross%3E2798550521%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2798550521&rft_id=info:pmid/&rfr_iscdi=true |