An Efficient and Effective Model to Handle Missing Data in Classification
Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choi...
Gespeichert in:
Veröffentlicht in: | BioMed research international 2020, Vol.2020 (2020), p.1-11 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 11 |
---|---|
container_issue | 2020 |
container_start_page | 1 |
container_title | BioMed research international |
container_volume | 2020 |
creator | Mehrabani-Zeinabad, Kamran Ayatollahi, Seyyed Mohammad Taghi Doostfatemeh, Marziyeh |
description | Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps. |
doi_str_mv | 10.1155/2020/8810143 |
format | Article |
fullrecord | <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_7710403</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A697108258</galeid><sourcerecordid>A697108258</sourcerecordid><originalsourceid>FETCH-LOGICAL-c476t-42436fd2373e7e4bcf92653bb585783ba1a49c9404475e9e224231f9d754288f3</originalsourceid><addsrcrecordid>eNqFkctLJDEQxoOsrKLe9iwNe1lYR_N-XIRhfILLXnbPIZ2ujJGeRDs9yv73pplBWS_mklTlV19V8SH0jeBTQoQ4o5jiM60JJpztoH3KCJ9JwsmXtzdje-iolAdcjyYSG_kV7TFGjdFK76PbeWouQ4g-Qhobl7opAj_GZ2h-5Q76ZszNTc33NY6lxLRsLtzompiaRe9qota6MeZ0iHaD6wscbe8D9Pfq8s_iZnb3-_p2Mb-bea7kOOOUMxk6yhQDBbz1wVApWNsKLZRmrSOOG2845lwJMEApr5sE0ynBqdaBHaDzje7jul1B5-vcg-vt4xBXbvhns4v2_58U7-0yP1ulCOaYVYEfW4EhP62hjHYVi4e-dwnyuljKpcEGK4Mr-v0D-pDXQ6rrTZQSWDDO3qml68HGFHLt6ydRO5emdtVU6EqdbCg_5FIGCG8jE2wnM-1kpt2aWfGfG_w-ps69xM_o4w0NlYHg3mnClJScvQIJjKLK</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2467505343</pqid></control><display><type>article</type><title>An Efficient and Effective Model to Handle Missing Data in Classification</title><source>PubMed Central Open Access</source><source>Wiley Online Library Open Access</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><creator>Mehrabani-Zeinabad, Kamran ; Ayatollahi, Seyyed Mohammad Taghi ; Doostfatemeh, Marziyeh</creator><contributor>Morales-Quintana, Luis ; Luis Morales-Quintana</contributor><creatorcontrib>Mehrabani-Zeinabad, Kamran ; Ayatollahi, Seyyed Mohammad Taghi ; Doostfatemeh, Marziyeh ; Morales-Quintana, Luis ; Luis Morales-Quintana</creatorcontrib><description>Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.</description><identifier>ISSN: 2314-6133</identifier><identifier>EISSN: 2314-6141</identifier><identifier>DOI: 10.1155/2020/8810143</identifier><identifier>PMID: 33299878</identifier><language>eng</language><publisher>Cairo, Egypt: Hindawi Publishing Corporation</publisher><subject>Accuracy ; Algorithms ; Bayesian analysis ; Biomedical research ; Classification ; Computer applications ; Computing time ; Data mining ; Datasets ; Decision tree ; Decision trees ; Information management ; Machine learning ; Mathematical models ; Methods ; Missing data ; Model accuracy ; Regression analysis ; Regression models ; Variables</subject><ispartof>BioMed research international, 2020, Vol.2020 (2020), p.1-11</ispartof><rights>Copyright © 2020 Kamran Mehrabani-Zeinabad et al.</rights><rights>COPYRIGHT 2020 John Wiley & Sons, Inc.</rights><rights>Copyright © 2020 Kamran Mehrabani-Zeinabad et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0</rights><rights>Copyright © 2020 Kamran Mehrabani-Zeinabad et al. 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c476t-42436fd2373e7e4bcf92653bb585783ba1a49c9404475e9e224231f9d754288f3</citedby><cites>FETCH-LOGICAL-c476t-42436fd2373e7e4bcf92653bb585783ba1a49c9404475e9e224231f9d754288f3</cites><orcidid>0000-0002-5691-9628 ; 0000-0003-3073-2600 ; 0000-0001-9619-6607</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7710403/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7710403/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,725,778,782,883,4012,27906,27907,27908,53774,53776</link.rule.ids></links><search><contributor>Morales-Quintana, Luis</contributor><contributor>Luis Morales-Quintana</contributor><creatorcontrib>Mehrabani-Zeinabad, Kamran</creatorcontrib><creatorcontrib>Ayatollahi, Seyyed Mohammad Taghi</creatorcontrib><creatorcontrib>Doostfatemeh, Marziyeh</creatorcontrib><title>An Efficient and Effective Model to Handle Missing Data in Classification</title><title>BioMed research international</title><description>Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Bayesian analysis</subject><subject>Biomedical research</subject><subject>Classification</subject><subject>Computer applications</subject><subject>Computing time</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Decision tree</subject><subject>Decision trees</subject><subject>Information management</subject><subject>Machine learning</subject><subject>Mathematical models</subject><subject>Methods</subject><subject>Missing data</subject><subject>Model accuracy</subject><subject>Regression analysis</subject><subject>Regression models</subject><subject>Variables</subject><issn>2314-6133</issn><issn>2314-6141</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RHX</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNqFkctLJDEQxoOsrKLe9iwNe1lYR_N-XIRhfILLXnbPIZ2ujJGeRDs9yv73pplBWS_mklTlV19V8SH0jeBTQoQ4o5jiM60JJpztoH3KCJ9JwsmXtzdje-iolAdcjyYSG_kV7TFGjdFK76PbeWouQ4g-Qhobl7opAj_GZ2h-5Q76ZszNTc33NY6lxLRsLtzompiaRe9qota6MeZ0iHaD6wscbe8D9Pfq8s_iZnb3-_p2Mb-bea7kOOOUMxk6yhQDBbz1wVApWNsKLZRmrSOOG2845lwJMEApr5sE0ynBqdaBHaDzje7jul1B5-vcg-vt4xBXbvhns4v2_58U7-0yP1ulCOaYVYEfW4EhP62hjHYVi4e-dwnyuljKpcEGK4Mr-v0D-pDXQ6rrTZQSWDDO3qml68HGFHLt6ydRO5emdtVU6EqdbCg_5FIGCG8jE2wnM-1kpt2aWfGfG_w-ps69xM_o4w0NlYHg3mnClJScvQIJjKLK</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Mehrabani-Zeinabad, Kamran</creator><creator>Ayatollahi, Seyyed Mohammad Taghi</creator><creator>Doostfatemeh, Marziyeh</creator><general>Hindawi Publishing Corporation</general><general>Hindawi</general><general>John Wiley & Sons, Inc</general><general>Hindawi Limited</general><scope>ADJCN</scope><scope>AHFXO</scope><scope>RHU</scope><scope>RHW</scope><scope>RHX</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7QL</scope><scope>7QO</scope><scope>7T7</scope><scope>7TK</scope><scope>7U7</scope><scope>7U9</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>C1K</scope><scope>CCPQU</scope><scope>CWDGH</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>H94</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>LK8</scope><scope>M0S</scope><scope>M1P</scope><scope>M7N</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-5691-9628</orcidid><orcidid>https://orcid.org/0000-0003-3073-2600</orcidid><orcidid>https://orcid.org/0000-0001-9619-6607</orcidid></search><sort><creationdate>2020</creationdate><title>An Efficient and Effective Model to Handle Missing Data in Classification</title><author>Mehrabani-Zeinabad, Kamran ; Ayatollahi, Seyyed Mohammad Taghi ; Doostfatemeh, Marziyeh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c476t-42436fd2373e7e4bcf92653bb585783ba1a49c9404475e9e224231f9d754288f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Bayesian analysis</topic><topic>Biomedical research</topic><topic>Classification</topic><topic>Computer applications</topic><topic>Computing time</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Decision tree</topic><topic>Decision trees</topic><topic>Information management</topic><topic>Machine learning</topic><topic>Mathematical models</topic><topic>Methods</topic><topic>Missing data</topic><topic>Model accuracy</topic><topic>Regression analysis</topic><topic>Regression models</topic><topic>Variables</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mehrabani-Zeinabad, Kamran</creatorcontrib><creatorcontrib>Ayatollahi, Seyyed Mohammad Taghi</creatorcontrib><creatorcontrib>Doostfatemeh, Marziyeh</creatorcontrib><collection>الدوريات العلمية والإحصائية - e-Marefa Academic and Statistical Periodicals</collection><collection>معرفة - المحتوى العربي الأكاديمي المتكامل - e-Marefa Academic Complete</collection><collection>Hindawi Publishing Complete</collection><collection>Hindawi Publishing Subscription Journals</collection><collection>Hindawi Publishing Open Access</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Bacteriology Abstracts (Microbiology B)</collection><collection>Biotechnology Research Abstracts</collection><collection>Industrial and Applied Microbiology Abstracts (Microbiology A)</collection><collection>Neurosciences Abstracts</collection><collection>Toxicology Abstracts</collection><collection>Virology and AIDS Abstracts</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>Natural Science Collection (ProQuest)</collection><collection>Environmental Sciences and Pollution Management</collection><collection>ProQuest One Community College</collection><collection>Middle East & Africa Database</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biological Science Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>BioMed research international</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mehrabani-Zeinabad, Kamran</au><au>Ayatollahi, Seyyed Mohammad Taghi</au><au>Doostfatemeh, Marziyeh</au><au>Morales-Quintana, Luis</au><au>Luis Morales-Quintana</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Efficient and Effective Model to Handle Missing Data in Classification</atitle><jtitle>BioMed research international</jtitle><date>2020</date><risdate>2020</risdate><volume>2020</volume><issue>2020</issue><spage>1</spage><epage>11</epage><pages>1-11</pages><issn>2314-6133</issn><eissn>2314-6141</eissn><abstract>Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.</abstract><cop>Cairo, Egypt</cop><pub>Hindawi Publishing Corporation</pub><pmid>33299878</pmid><doi>10.1155/2020/8810143</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-5691-9628</orcidid><orcidid>https://orcid.org/0000-0003-3073-2600</orcidid><orcidid>https://orcid.org/0000-0001-9619-6607</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2314-6133 |
ispartof | BioMed research international, 2020, Vol.2020 (2020), p.1-11 |
issn | 2314-6133 2314-6141 |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_7710403 |
source | PubMed Central Open Access; Wiley Online Library Open Access; PubMed Central; Alma/SFX Local Collection |
subjects | Accuracy Algorithms Bayesian analysis Biomedical research Classification Computer applications Computing time Data mining Datasets Decision tree Decision trees Information management Machine learning Mathematical models Methods Missing data Model accuracy Regression analysis Regression models Variables |
title | An Efficient and Effective Model to Handle Missing Data in Classification |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T07%3A33%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Efficient%20and%20Effective%20Model%20to%20Handle%20Missing%20Data%20in%20Classification&rft.jtitle=BioMed%20research%20international&rft.au=Mehrabani-Zeinabad,%20Kamran&rft.date=2020&rft.volume=2020&rft.issue=2020&rft.spage=1&rft.epage=11&rft.pages=1-11&rft.issn=2314-6133&rft.eissn=2314-6141&rft_id=info:doi/10.1155/2020/8810143&rft_dat=%3Cgale_pubme%3EA697108258%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2467505343&rft_id=info:pmid/33299878&rft_galeid=A697108258&rfr_iscdi=true |