A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer

World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis include...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of intelligent systems and applications 2021-08, Vol.13 (4), p.24-37
Hauptverfasser: Chaudhuri, Avijit Kumar, Banerjee, Dilip K., Das, Anirban
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 37
container_issue 4
container_start_page 24
container_title International journal of intelligent systems and applications
container_volume 13
creator Chaudhuri, Avijit Kumar
Banerjee, Dilip K.
Das, Anirban
description World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold cross-validations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works. Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa sta
doi_str_mv 10.5815/ijisa.2021.04.03
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2798550521</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2798550521</sourcerecordid><originalsourceid>FETCH-LOGICAL-c1113-bee689a9ce85cd90f1f61f25ac84ae5a23dd23cf3d32bde69ad3661d4404ac013</originalsourceid><addsrcrecordid>eNo9kM1LAzEQxYMoWGrvHgOed518bbPH2loVWjxUwVuYJrOwte7WJD3437tV8V3mwXszAz_GrgWUxgpz2-7ahKUEKUrQJagzNpIw1UUNxp7_e_12ySYp7WBQZbUV9YitZ3yBGRNlPqcux9bzJWE-RuIb2pPPbd9x7ALfZPTvFPi6D7TnuecLykPM7yJhGpax8xSv2EWD-0STvzlmr8v7l_ljsXp-eJrPVoUXQqhiS1TZGmtP1vhQQyOaSjTSoLcayaBUIUjlGxWU3AaqagyqqkTQGjR6EGrMbn7vHmL_eaSU3a4_xm546eS0tsaAkacW_LZ87FOK1LhDbD8wfjkB7sTN_XBzJ24OtAOlvgE7qmCq</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2798550521</pqid></control><display><type>article</type><title>A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer</title><source>EZB-FREE-00999 freely available EZB journals</source><creator>Chaudhuri, Avijit Kumar ; Banerjee, Dilip K. ; Das, Anirban</creator><creatorcontrib>Chaudhuri, Avijit Kumar ; Banerjee, Dilip K. ; Das, Anirban ; Research Scholar, Department of Computer Application, SEACOM SKILLS UNIVERSITY, Kendradangal, Bolpur, Dist:Birbhum, PIN - 731 236, West Bengal</creatorcontrib><description>World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold cross-validations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works. Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa statistics – to check the strength of the proposed DCA classifier. Various splits of training and testing data – namely, 50–50%, 66–34%, 80–20% and 10-fold cross-validation – have been incorporated in this research to test the credibility of the classification models in handling the unbalanced data. Finally, the proposed DCA model demonstrated its potential with almost 100% performance in the classification results. The output results have also been compared with other research on the same dataset where the proposed classifiers were found to be best across all the performance dimensions.</description><identifier>ISSN: 2074-904X</identifier><identifier>EISSN: 2074-9058</identifier><identifier>DOI: 10.5815/ijisa.2021.04.03</identifier><language>eng</language><publisher>Hong Kong: Modern Education and Computer Science Press</publisher><subject>Accuracy ; Breast cancer ; Classification ; Classifiers ; Data mining ; Datasets ; Decision trees ; Fatalities ; Genetic algorithms ; Identification methods ; Kernel functions ; Machine learning ; Model accuracy ; Optimization ; Performance measurement ; Principles ; Radial basis function ; Rank tests ; Statistical analysis ; Statistical tests ; Support vector machines ; Training</subject><ispartof>International journal of intelligent systems and applications, 2021-08, Vol.13 (4), p.24-37</ispartof><rights>2021. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the associated terms available at http://www.mecs-press.org/ijcnis/terms.html</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>315,782,786,27933,27934</link.rule.ids></links><search><creatorcontrib>Chaudhuri, Avijit Kumar</creatorcontrib><creatorcontrib>Banerjee, Dilip K.</creatorcontrib><creatorcontrib>Das, Anirban</creatorcontrib><creatorcontrib>Research Scholar, Department of Computer Application, SEACOM SKILLS UNIVERSITY, Kendradangal, Bolpur, Dist:Birbhum, PIN - 731 236, West Bengal</creatorcontrib><title>A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer</title><title>International journal of intelligent systems and applications</title><description>World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold cross-validations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works. Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa statistics – to check the strength of the proposed DCA classifier. Various splits of training and testing data – namely, 50–50%, 66–34%, 80–20% and 10-fold cross-validation – have been incorporated in this research to test the credibility of the classification models in handling the unbalanced data. Finally, the proposed DCA model demonstrated its potential with almost 100% performance in the classification results. The output results have also been compared with other research on the same dataset where the proposed classifiers were found to be best across all the performance dimensions.</description><subject>Accuracy</subject><subject>Breast cancer</subject><subject>Classification</subject><subject>Classifiers</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Decision trees</subject><subject>Fatalities</subject><subject>Genetic algorithms</subject><subject>Identification methods</subject><subject>Kernel functions</subject><subject>Machine learning</subject><subject>Model accuracy</subject><subject>Optimization</subject><subject>Performance measurement</subject><subject>Principles</subject><subject>Radial basis function</subject><subject>Rank tests</subject><subject>Statistical analysis</subject><subject>Statistical tests</subject><subject>Support vector machines</subject><subject>Training</subject><issn>2074-904X</issn><issn>2074-9058</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNo9kM1LAzEQxYMoWGrvHgOed518bbPH2loVWjxUwVuYJrOwte7WJD3437tV8V3mwXszAz_GrgWUxgpz2-7ahKUEKUrQJagzNpIw1UUNxp7_e_12ySYp7WBQZbUV9YitZ3yBGRNlPqcux9bzJWE-RuIb2pPPbd9x7ALfZPTvFPi6D7TnuecLykPM7yJhGpax8xSv2EWD-0STvzlmr8v7l_ljsXp-eJrPVoUXQqhiS1TZGmtP1vhQQyOaSjTSoLcayaBUIUjlGxWU3AaqagyqqkTQGjR6EGrMbn7vHmL_eaSU3a4_xm546eS0tsaAkacW_LZ87FOK1LhDbD8wfjkB7sTN_XBzJ24OtAOlvgE7qmCq</recordid><startdate>20210808</startdate><enddate>20210808</enddate><creator>Chaudhuri, Avijit Kumar</creator><creator>Banerjee, Dilip K.</creator><creator>Das, Anirban</creator><general>Modern Education and Computer Science Press</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7XB</scope><scope>8AL</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BVBZV</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>M0N</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope></search><sort><creationdate>20210808</creationdate><title>A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer</title><author>Chaudhuri, Avijit Kumar ; Banerjee, Dilip K. ; Das, Anirban</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c1113-bee689a9ce85cd90f1f61f25ac84ae5a23dd23cf3d32bde69ad3661d4404ac013</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Accuracy</topic><topic>Breast cancer</topic><topic>Classification</topic><topic>Classifiers</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Decision trees</topic><topic>Fatalities</topic><topic>Genetic algorithms</topic><topic>Identification methods</topic><topic>Kernel functions</topic><topic>Machine learning</topic><topic>Model accuracy</topic><topic>Optimization</topic><topic>Performance measurement</topic><topic>Principles</topic><topic>Radial basis function</topic><topic>Rank tests</topic><topic>Statistical analysis</topic><topic>Statistical tests</topic><topic>Support vector machines</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Chaudhuri, Avijit Kumar</creatorcontrib><creatorcontrib>Banerjee, Dilip K.</creatorcontrib><creatorcontrib>Das, Anirban</creatorcontrib><creatorcontrib>Research Scholar, Department of Computer Application, SEACOM SKILLS UNIVERSITY, Kendradangal, Bolpur, Dist:Birbhum, PIN - 731 236, West Bengal</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>East &amp; South Asia Database</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Computing Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>International journal of intelligent systems and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chaudhuri, Avijit Kumar</au><au>Banerjee, Dilip K.</au><au>Das, Anirban</au><aucorp>Research Scholar, Department of Computer Application, SEACOM SKILLS UNIVERSITY, Kendradangal, Bolpur, Dist:Birbhum, PIN - 731 236, West Bengal</aucorp><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer</atitle><jtitle>International journal of intelligent systems and applications</jtitle><date>2021-08-08</date><risdate>2021</risdate><volume>13</volume><issue>4</issue><spage>24</spage><epage>37</epage><pages>24-37</pages><issn>2074-904X</issn><eissn>2074-9058</eissn><abstract>World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold cross-validations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works. Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa statistics – to check the strength of the proposed DCA classifier. Various splits of training and testing data – namely, 50–50%, 66–34%, 80–20% and 10-fold cross-validation – have been incorporated in this research to test the credibility of the classification models in handling the unbalanced data. Finally, the proposed DCA model demonstrated its potential with almost 100% performance in the classification results. The output results have also been compared with other research on the same dataset where the proposed classifiers were found to be best across all the performance dimensions.</abstract><cop>Hong Kong</cop><pub>Modern Education and Computer Science Press</pub><doi>10.5815/ijisa.2021.04.03</doi><tpages>14</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2074-904X
ispartof International journal of intelligent systems and applications, 2021-08, Vol.13 (4), p.24-37
issn 2074-904X
2074-9058
language eng
recordid cdi_proquest_journals_2798550521
source EZB-FREE-00999 freely available EZB journals
subjects Accuracy
Breast cancer
Classification
Classifiers
Data mining
Datasets
Decision trees
Fatalities
Genetic algorithms
Identification methods
Kernel functions
Machine learning
Model accuracy
Optimization
Performance measurement
Principles
Radial basis function
Rank tests
Statistical analysis
Statistical tests
Support vector machines
Training
title A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-03T11%3A45%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Dataset%20Centric%20Feature%20Selection%20and%20Stacked%20Model%20to%20Detect%20Breast%20Cancer&rft.jtitle=International%20journal%20of%20intelligent%20systems%20and%20applications&rft.au=Chaudhuri,%20Avijit%20Kumar&rft.aucorp=Research%20Scholar,%20Department%20of%20Computer%20Application,%20SEACOM%20SKILLS%20UNIVERSITY,%20Kendradangal,%20Bolpur,%20Dist:Birbhum,%20PIN%20-%20731%20236,%20West%20Bengal&rft.date=2021-08-08&rft.volume=13&rft.issue=4&rft.spage=24&rft.epage=37&rft.pages=24-37&rft.issn=2074-904X&rft.eissn=2074-9058&rft_id=info:doi/10.5815/ijisa.2021.04.03&rft_dat=%3Cproquest_cross%3E2798550521%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2798550521&rft_id=info:pmid/&rfr_iscdi=true