An Efficient and Effective Model to Handle Missing Data in Classification

Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:BioMed research international 2020, Vol.2020 (2020), p.1-11
Hauptverfasser: Mehrabani-Zeinabad, Kamran, Ayatollahi, Seyyed Mohammad Taghi, Doostfatemeh, Marziyeh
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 11
container_issue 2020
container_start_page 1
container_title BioMed research international
container_volume 2020
creator Mehrabani-Zeinabad, Kamran
Ayatollahi, Seyyed Mohammad Taghi
Doostfatemeh, Marziyeh
description Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.
doi_str_mv 10.1155/2020/8810143
format Article
fullrecord <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_7710403</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A697108258</galeid><sourcerecordid>A697108258</sourcerecordid><originalsourceid>FETCH-LOGICAL-c476t-42436fd2373e7e4bcf92653bb585783ba1a49c9404475e9e224231f9d754288f3</originalsourceid><addsrcrecordid>eNqFkctLJDEQxoOsrKLe9iwNe1lYR_N-XIRhfILLXnbPIZ2ujJGeRDs9yv73pplBWS_mklTlV19V8SH0jeBTQoQ4o5jiM60JJpztoH3KCJ9JwsmXtzdje-iolAdcjyYSG_kV7TFGjdFK76PbeWouQ4g-Qhobl7opAj_GZ2h-5Q76ZszNTc33NY6lxLRsLtzompiaRe9qota6MeZ0iHaD6wscbe8D9Pfq8s_iZnb3-_p2Mb-bea7kOOOUMxk6yhQDBbz1wVApWNsKLZRmrSOOG2845lwJMEApr5sE0ynBqdaBHaDzje7jul1B5-vcg-vt4xBXbvhns4v2_58U7-0yP1ulCOaYVYEfW4EhP62hjHYVi4e-dwnyuljKpcEGK4Mr-v0D-pDXQ6rrTZQSWDDO3qml68HGFHLt6ydRO5emdtVU6EqdbCg_5FIGCG8jE2wnM-1kpt2aWfGfG_w-ps69xM_o4w0NlYHg3mnClJScvQIJjKLK</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2467505343</pqid></control><display><type>article</type><title>An Efficient and Effective Model to Handle Missing Data in Classification</title><source>PubMed Central Open Access</source><source>Wiley Online Library Open Access</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><creator>Mehrabani-Zeinabad, Kamran ; Ayatollahi, Seyyed Mohammad Taghi ; Doostfatemeh, Marziyeh</creator><contributor>Morales-Quintana, Luis ; Luis Morales-Quintana</contributor><creatorcontrib>Mehrabani-Zeinabad, Kamran ; Ayatollahi, Seyyed Mohammad Taghi ; Doostfatemeh, Marziyeh ; Morales-Quintana, Luis ; Luis Morales-Quintana</creatorcontrib><description>Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.</description><identifier>ISSN: 2314-6133</identifier><identifier>EISSN: 2314-6141</identifier><identifier>DOI: 10.1155/2020/8810143</identifier><identifier>PMID: 33299878</identifier><language>eng</language><publisher>Cairo, Egypt: Hindawi Publishing Corporation</publisher><subject>Accuracy ; Algorithms ; Bayesian analysis ; Biomedical research ; Classification ; Computer applications ; Computing time ; Data mining ; Datasets ; Decision tree ; Decision trees ; Information management ; Machine learning ; Mathematical models ; Methods ; Missing data ; Model accuracy ; Regression analysis ; Regression models ; Variables</subject><ispartof>BioMed research international, 2020, Vol.2020 (2020), p.1-11</ispartof><rights>Copyright © 2020 Kamran Mehrabani-Zeinabad et al.</rights><rights>COPYRIGHT 2020 John Wiley &amp; Sons, Inc.</rights><rights>Copyright © 2020 Kamran Mehrabani-Zeinabad et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0</rights><rights>Copyright © 2020 Kamran Mehrabani-Zeinabad et al. 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c476t-42436fd2373e7e4bcf92653bb585783ba1a49c9404475e9e224231f9d754288f3</citedby><cites>FETCH-LOGICAL-c476t-42436fd2373e7e4bcf92653bb585783ba1a49c9404475e9e224231f9d754288f3</cites><orcidid>0000-0002-5691-9628 ; 0000-0003-3073-2600 ; 0000-0001-9619-6607</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7710403/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7710403/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,725,778,782,883,4012,27906,27907,27908,53774,53776</link.rule.ids></links><search><contributor>Morales-Quintana, Luis</contributor><contributor>Luis Morales-Quintana</contributor><creatorcontrib>Mehrabani-Zeinabad, Kamran</creatorcontrib><creatorcontrib>Ayatollahi, Seyyed Mohammad Taghi</creatorcontrib><creatorcontrib>Doostfatemeh, Marziyeh</creatorcontrib><title>An Efficient and Effective Model to Handle Missing Data in Classification</title><title>BioMed research international</title><description>Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Bayesian analysis</subject><subject>Biomedical research</subject><subject>Classification</subject><subject>Computer applications</subject><subject>Computing time</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Decision tree</subject><subject>Decision trees</subject><subject>Information management</subject><subject>Machine learning</subject><subject>Mathematical models</subject><subject>Methods</subject><subject>Missing data</subject><subject>Model accuracy</subject><subject>Regression analysis</subject><subject>Regression models</subject><subject>Variables</subject><issn>2314-6133</issn><issn>2314-6141</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RHX</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNqFkctLJDEQxoOsrKLe9iwNe1lYR_N-XIRhfILLXnbPIZ2ujJGeRDs9yv73pplBWS_mklTlV19V8SH0jeBTQoQ4o5jiM60JJpztoH3KCJ9JwsmXtzdje-iolAdcjyYSG_kV7TFGjdFK76PbeWouQ4g-Qhobl7opAj_GZ2h-5Q76ZszNTc33NY6lxLRsLtzompiaRe9qota6MeZ0iHaD6wscbe8D9Pfq8s_iZnb3-_p2Mb-bea7kOOOUMxk6yhQDBbz1wVApWNsKLZRmrSOOG2845lwJMEApr5sE0ynBqdaBHaDzje7jul1B5-vcg-vt4xBXbvhns4v2_58U7-0yP1ulCOaYVYEfW4EhP62hjHYVi4e-dwnyuljKpcEGK4Mr-v0D-pDXQ6rrTZQSWDDO3qml68HGFHLt6ydRO5emdtVU6EqdbCg_5FIGCG8jE2wnM-1kpt2aWfGfG_w-ps69xM_o4w0NlYHg3mnClJScvQIJjKLK</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Mehrabani-Zeinabad, Kamran</creator><creator>Ayatollahi, Seyyed Mohammad Taghi</creator><creator>Doostfatemeh, Marziyeh</creator><general>Hindawi Publishing Corporation</general><general>Hindawi</general><general>John Wiley &amp; Sons, Inc</general><general>Hindawi Limited</general><scope>ADJCN</scope><scope>AHFXO</scope><scope>RHU</scope><scope>RHW</scope><scope>RHX</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7QL</scope><scope>7QO</scope><scope>7T7</scope><scope>7TK</scope><scope>7U7</scope><scope>7U9</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>C1K</scope><scope>CCPQU</scope><scope>CWDGH</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>H94</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>LK8</scope><scope>M0S</scope><scope>M1P</scope><scope>M7N</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-5691-9628</orcidid><orcidid>https://orcid.org/0000-0003-3073-2600</orcidid><orcidid>https://orcid.org/0000-0001-9619-6607</orcidid></search><sort><creationdate>2020</creationdate><title>An Efficient and Effective Model to Handle Missing Data in Classification</title><author>Mehrabani-Zeinabad, Kamran ; Ayatollahi, Seyyed Mohammad Taghi ; Doostfatemeh, Marziyeh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c476t-42436fd2373e7e4bcf92653bb585783ba1a49c9404475e9e224231f9d754288f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Bayesian analysis</topic><topic>Biomedical research</topic><topic>Classification</topic><topic>Computer applications</topic><topic>Computing time</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Decision tree</topic><topic>Decision trees</topic><topic>Information management</topic><topic>Machine learning</topic><topic>Mathematical models</topic><topic>Methods</topic><topic>Missing data</topic><topic>Model accuracy</topic><topic>Regression analysis</topic><topic>Regression models</topic><topic>Variables</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mehrabani-Zeinabad, Kamran</creatorcontrib><creatorcontrib>Ayatollahi, Seyyed Mohammad Taghi</creatorcontrib><creatorcontrib>Doostfatemeh, Marziyeh</creatorcontrib><collection>الدوريات العلمية والإحصائية - e-Marefa Academic and Statistical Periodicals</collection><collection>معرفة - المحتوى العربي الأكاديمي المتكامل - e-Marefa Academic Complete</collection><collection>Hindawi Publishing Complete</collection><collection>Hindawi Publishing Subscription Journals</collection><collection>Hindawi Publishing Open Access</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Bacteriology Abstracts (Microbiology B)</collection><collection>Biotechnology Research Abstracts</collection><collection>Industrial and Applied Microbiology Abstracts (Microbiology A)</collection><collection>Neurosciences Abstracts</collection><collection>Toxicology Abstracts</collection><collection>Virology and AIDS Abstracts</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>Natural Science Collection (ProQuest)</collection><collection>Environmental Sciences and Pollution Management</collection><collection>ProQuest One Community College</collection><collection>Middle East &amp; Africa Database</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biological Science Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>BioMed research international</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mehrabani-Zeinabad, Kamran</au><au>Ayatollahi, Seyyed Mohammad Taghi</au><au>Doostfatemeh, Marziyeh</au><au>Morales-Quintana, Luis</au><au>Luis Morales-Quintana</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Efficient and Effective Model to Handle Missing Data in Classification</atitle><jtitle>BioMed research international</jtitle><date>2020</date><risdate>2020</risdate><volume>2020</volume><issue>2020</issue><spage>1</spage><epage>11</epage><pages>1-11</pages><issn>2314-6133</issn><eissn>2314-6141</eissn><abstract>Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.</abstract><cop>Cairo, Egypt</cop><pub>Hindawi Publishing Corporation</pub><pmid>33299878</pmid><doi>10.1155/2020/8810143</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-5691-9628</orcidid><orcidid>https://orcid.org/0000-0003-3073-2600</orcidid><orcidid>https://orcid.org/0000-0001-9619-6607</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2314-6133
ispartof BioMed research international, 2020, Vol.2020 (2020), p.1-11
issn 2314-6133
2314-6141
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_7710403
source PubMed Central Open Access; Wiley Online Library Open Access; PubMed Central; Alma/SFX Local Collection
subjects Accuracy
Algorithms
Bayesian analysis
Biomedical research
Classification
Computer applications
Computing time
Data mining
Datasets
Decision tree
Decision trees
Information management
Machine learning
Mathematical models
Methods
Missing data
Model accuracy
Regression analysis
Regression models
Variables
title An Efficient and Effective Model to Handle Missing Data in Classification
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T07%3A33%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Efficient%20and%20Effective%20Model%20to%20Handle%20Missing%20Data%20in%20Classification&rft.jtitle=BioMed%20research%20international&rft.au=Mehrabani-Zeinabad,%20Kamran&rft.date=2020&rft.volume=2020&rft.issue=2020&rft.spage=1&rft.epage=11&rft.pages=1-11&rft.issn=2314-6133&rft.eissn=2314-6141&rft_id=info:doi/10.1155/2020/8810143&rft_dat=%3Cgale_pubme%3EA697108258%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2467505343&rft_id=info:pmid/33299878&rft_galeid=A697108258&rfr_iscdi=true