Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID‐19

Summary Real‐time polymerase chain reaction (RT‐PCR) known as the swab test is a diagnostic test that can diagnose COVID‐19 disease through respiratory samples in the laboratory. Due to the rapid spread of the coronavirus around the world, the RT‐PCR test has become insufficient to get fast results....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Concurrency and computation 2022-12, Vol.34 (28), p.e7393-n/a
Hauptverfasser: Erol, Gizemnur, Uzbaş, Betül, Yücelbaş, Cüneyt, Yücelbaş, Şule
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page n/a
container_issue 28
container_start_page e7393
container_title Concurrency and computation
container_volume 34
creator Erol, Gizemnur
Uzbaş, Betül
Yücelbaş, Cüneyt
Yücelbaş, Şule
description Summary Real‐time polymerase chain reaction (RT‐PCR) known as the swab test is a diagnostic test that can diagnose COVID‐19 disease through respiratory samples in the laboratory. Due to the rapid spread of the coronavirus around the world, the RT‐PCR test has become insufficient to get fast results. For this reason, the need for diagnostic methods to fill this gap has arisen and machine learning studies have started in this area. On the other hand, studying medical data is a challenging area because the data it contains is inconsistent, incomplete, difficult to scale, and very large. Additionally, some poor clinical decisions, irrelevant parameters, and limited medical data adversely affect the accuracy of studies performed. Therefore, considering the availability of datasets containing COVID‐19 blood parameters, which are less in number than other medical datasets today, it is aimed to improve these existing datasets. In this direction, to obtain more consistent results in COVID‐19 machine learning studies, the effect of data preprocessing techniques on the classification of COVID‐19 data was investigated in this study. In this study primarily, encoding categorical feature and feature scaling processes were applied to the dataset with 15 features that contain blood data of 279 patients, including gender and age information. Then, the missingness of the dataset was eliminated by using both K‐nearest neighbor algorithm (KNN) and chain equations multiple value assignment (MICE) methods. Data balancing has been done with synthetic minority oversampling technique (SMOTE), which is a data balancing method. The effect of data preprocessing techniques on ensemble learning algorithms bagging, AdaBoost, random forest and on popular classifier algorithms KNN classifier, support vector machine, logistic regression, artificial neural network, and decision tree classifiers have been analyzed. The highest accuracies obtained with the bagging classifier were 83.42% and 83.74% with KNN and MICE imputations by applying SMOTE, respectively. On the other hand, the highest accuracy ratio reached with the same classifier without SMOTE was 83.91% for the KNN imputation. In conclusion, certain data preprocessing techniques are examined comparatively and the effect of these data preprocessing techniques on success is presented and the importance of the right combination of data preprocessing to achieve success has been demonstrated by experimental studies.
doi_str_mv 10.1002/cpe.7393
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_9874401</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2771086958</sourcerecordid><originalsourceid>FETCH-LOGICAL-c4383-7fe0724d1b2dcbca3086fab0d9f2a20d7c7a41dd8115642ed6653b4fbda8df443</originalsourceid><addsrcrecordid>eNp1kc1u1DAURq2qiJaC1CdAkbphk-K_xMmmUjUUqFSpLICt5djXE1eJndpJ0bDiEXhGnoRkWgao1JWt66Oj7_pD6JjgU4IxfasHOBWsZnvokBSM5rhkfH93p-UBepHSDcaEYEaeowNWCsJJhQ_R3blX3ea78-tsbCEDa0GPWbCZUaPKhghDDBpS2gKgW-9uJ0jZtB30SrfOQ9aBin4ZqG4dohvbPmXBb4XGqbUPyaXFubr-evnu14-fpH6JnlnVJXj1cB6hL-8vPq8-5lfXHy5X51e55qxiubCABeWGNNToRiuGq9KqBpvaUkWxEVooToypCClKTsGUZcEabhujKmM5Z0fo7N47TE0PRoMfo-rkEF2v4kYG5eT_L961ch3uZF0JzjGZBW8eBDEsm4-yd0lD1ykPYUqSCkHmUHVRzejJI_QmTHH-3oViFa0JLfBfoY4hpQh2F4ZguZQp5zLlUuaMvv43_A78094M5PfAN9fB5kmRXH262Ap_A58Tq2g</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2738291250</pqid></control><display><type>article</type><title>Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID‐19</title><source>Wiley Journals</source><creator>Erol, Gizemnur ; Uzbaş, Betül ; Yücelbaş, Cüneyt ; Yücelbaş, Şule</creator><creatorcontrib>Erol, Gizemnur ; Uzbaş, Betül ; Yücelbaş, Cüneyt ; Yücelbaş, Şule</creatorcontrib><description>Summary Real‐time polymerase chain reaction (RT‐PCR) known as the swab test is a diagnostic test that can diagnose COVID‐19 disease through respiratory samples in the laboratory. Due to the rapid spread of the coronavirus around the world, the RT‐PCR test has become insufficient to get fast results. For this reason, the need for diagnostic methods to fill this gap has arisen and machine learning studies have started in this area. On the other hand, studying medical data is a challenging area because the data it contains is inconsistent, incomplete, difficult to scale, and very large. Additionally, some poor clinical decisions, irrelevant parameters, and limited medical data adversely affect the accuracy of studies performed. Therefore, considering the availability of datasets containing COVID‐19 blood parameters, which are less in number than other medical datasets today, it is aimed to improve these existing datasets. In this direction, to obtain more consistent results in COVID‐19 machine learning studies, the effect of data preprocessing techniques on the classification of COVID‐19 data was investigated in this study. In this study primarily, encoding categorical feature and feature scaling processes were applied to the dataset with 15 features that contain blood data of 279 patients, including gender and age information. Then, the missingness of the dataset was eliminated by using both K‐nearest neighbor algorithm (KNN) and chain equations multiple value assignment (MICE) methods. Data balancing has been done with synthetic minority oversampling technique (SMOTE), which is a data balancing method. The effect of data preprocessing techniques on ensemble learning algorithms bagging, AdaBoost, random forest and on popular classifier algorithms KNN classifier, support vector machine, logistic regression, artificial neural network, and decision tree classifiers have been analyzed. The highest accuracies obtained with the bagging classifier were 83.42% and 83.74% with KNN and MICE imputations by applying SMOTE, respectively. On the other hand, the highest accuracy ratio reached with the same classifier without SMOTE was 83.91% for the KNN imputation. In conclusion, certain data preprocessing techniques are examined comparatively and the effect of these data preprocessing techniques on success is presented and the importance of the right combination of data preprocessing to achieve success has been demonstrated by experimental studies.</description><identifier>ISSN: 1532-0626</identifier><identifier>EISSN: 1532-0634</identifier><identifier>DOI: 10.1002/cpe.7393</identifier><identifier>PMID: 36714180</identifier><language>eng</language><publisher>Hoboken, USA: John Wiley &amp; Sons, Inc</publisher><subject>Algorithms ; Artificial neural networks ; Bagging ; Balancing ; Blood ; Classifiers ; COVID-19 ; Datasets ; Decision analysis ; Decision trees ; Diagnostic systems ; KNN imputation ; Machine learning ; multivariate imputation by chained equation ; Parameters ; Polymerase chain reaction ; Preprocessing ; Support vector machines ; synthetic minority oversampling technique ; Viral diseases</subject><ispartof>Concurrency and computation, 2022-12, Vol.34 (28), p.e7393-n/a</ispartof><rights>2022 John Wiley &amp; Sons, Ltd.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c4383-7fe0724d1b2dcbca3086fab0d9f2a20d7c7a41dd8115642ed6653b4fbda8df443</citedby><cites>FETCH-LOGICAL-c4383-7fe0724d1b2dcbca3086fab0d9f2a20d7c7a41dd8115642ed6653b4fbda8df443</cites><orcidid>0000-0001-9347-9775 ; 0000-0002-0255-5988 ; 0000-0002-4005-6557 ; 0000-0002-6758-8502</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1002%2Fcpe.7393$$EPDF$$P50$$Gwiley$$H</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1002%2Fcpe.7393$$EHTML$$P50$$Gwiley$$H</linktohtml><link.rule.ids>230,314,780,784,885,1417,27924,27925,45574,45575</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/36714180$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Erol, Gizemnur</creatorcontrib><creatorcontrib>Uzbaş, Betül</creatorcontrib><creatorcontrib>Yücelbaş, Cüneyt</creatorcontrib><creatorcontrib>Yücelbaş, Şule</creatorcontrib><title>Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID‐19</title><title>Concurrency and computation</title><addtitle>Concurr Comput</addtitle><description>Summary Real‐time polymerase chain reaction (RT‐PCR) known as the swab test is a diagnostic test that can diagnose COVID‐19 disease through respiratory samples in the laboratory. Due to the rapid spread of the coronavirus around the world, the RT‐PCR test has become insufficient to get fast results. For this reason, the need for diagnostic methods to fill this gap has arisen and machine learning studies have started in this area. On the other hand, studying medical data is a challenging area because the data it contains is inconsistent, incomplete, difficult to scale, and very large. Additionally, some poor clinical decisions, irrelevant parameters, and limited medical data adversely affect the accuracy of studies performed. Therefore, considering the availability of datasets containing COVID‐19 blood parameters, which are less in number than other medical datasets today, it is aimed to improve these existing datasets. In this direction, to obtain more consistent results in COVID‐19 machine learning studies, the effect of data preprocessing techniques on the classification of COVID‐19 data was investigated in this study. In this study primarily, encoding categorical feature and feature scaling processes were applied to the dataset with 15 features that contain blood data of 279 patients, including gender and age information. Then, the missingness of the dataset was eliminated by using both K‐nearest neighbor algorithm (KNN) and chain equations multiple value assignment (MICE) methods. Data balancing has been done with synthetic minority oversampling technique (SMOTE), which is a data balancing method. The effect of data preprocessing techniques on ensemble learning algorithms bagging, AdaBoost, random forest and on popular classifier algorithms KNN classifier, support vector machine, logistic regression, artificial neural network, and decision tree classifiers have been analyzed. The highest accuracies obtained with the bagging classifier were 83.42% and 83.74% with KNN and MICE imputations by applying SMOTE, respectively. On the other hand, the highest accuracy ratio reached with the same classifier without SMOTE was 83.91% for the KNN imputation. In conclusion, certain data preprocessing techniques are examined comparatively and the effect of these data preprocessing techniques on success is presented and the importance of the right combination of data preprocessing to achieve success has been demonstrated by experimental studies.</description><subject>Algorithms</subject><subject>Artificial neural networks</subject><subject>Bagging</subject><subject>Balancing</subject><subject>Blood</subject><subject>Classifiers</subject><subject>COVID-19</subject><subject>Datasets</subject><subject>Decision analysis</subject><subject>Decision trees</subject><subject>Diagnostic systems</subject><subject>KNN imputation</subject><subject>Machine learning</subject><subject>multivariate imputation by chained equation</subject><subject>Parameters</subject><subject>Polymerase chain reaction</subject><subject>Preprocessing</subject><subject>Support vector machines</subject><subject>synthetic minority oversampling technique</subject><subject>Viral diseases</subject><issn>1532-0626</issn><issn>1532-0634</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp1kc1u1DAURq2qiJaC1CdAkbphk-K_xMmmUjUUqFSpLICt5djXE1eJndpJ0bDiEXhGnoRkWgao1JWt66Oj7_pD6JjgU4IxfasHOBWsZnvokBSM5rhkfH93p-UBepHSDcaEYEaeowNWCsJJhQ_R3blX3ea78-tsbCEDa0GPWbCZUaPKhghDDBpS2gKgW-9uJ0jZtB30SrfOQ9aBin4ZqG4dohvbPmXBb4XGqbUPyaXFubr-evnu14-fpH6JnlnVJXj1cB6hL-8vPq8-5lfXHy5X51e55qxiubCABeWGNNToRiuGq9KqBpvaUkWxEVooToypCClKTsGUZcEabhujKmM5Z0fo7N47TE0PRoMfo-rkEF2v4kYG5eT_L961ch3uZF0JzjGZBW8eBDEsm4-yd0lD1ykPYUqSCkHmUHVRzejJI_QmTHH-3oViFa0JLfBfoY4hpQh2F4ZguZQp5zLlUuaMvv43_A78094M5PfAN9fB5kmRXH262Ap_A58Tq2g</recordid><startdate>20221225</startdate><enddate>20221225</enddate><creator>Erol, Gizemnur</creator><creator>Uzbaş, Betül</creator><creator>Yücelbaş, Cüneyt</creator><creator>Yücelbaş, Şule</creator><general>John Wiley &amp; Sons, Inc</general><general>Wiley Subscription Services, Inc</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0001-9347-9775</orcidid><orcidid>https://orcid.org/0000-0002-0255-5988</orcidid><orcidid>https://orcid.org/0000-0002-4005-6557</orcidid><orcidid>https://orcid.org/0000-0002-6758-8502</orcidid></search><sort><creationdate>20221225</creationdate><title>Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID‐19</title><author>Erol, Gizemnur ; Uzbaş, Betül ; Yücelbaş, Cüneyt ; Yücelbaş, Şule</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c4383-7fe0724d1b2dcbca3086fab0d9f2a20d7c7a41dd8115642ed6653b4fbda8df443</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Artificial neural networks</topic><topic>Bagging</topic><topic>Balancing</topic><topic>Blood</topic><topic>Classifiers</topic><topic>COVID-19</topic><topic>Datasets</topic><topic>Decision analysis</topic><topic>Decision trees</topic><topic>Diagnostic systems</topic><topic>KNN imputation</topic><topic>Machine learning</topic><topic>multivariate imputation by chained equation</topic><topic>Parameters</topic><topic>Polymerase chain reaction</topic><topic>Preprocessing</topic><topic>Support vector machines</topic><topic>synthetic minority oversampling technique</topic><topic>Viral diseases</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Erol, Gizemnur</creatorcontrib><creatorcontrib>Uzbaş, Betül</creatorcontrib><creatorcontrib>Yücelbaş, Cüneyt</creatorcontrib><creatorcontrib>Yücelbaş, Şule</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Concurrency and computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Erol, Gizemnur</au><au>Uzbaş, Betül</au><au>Yücelbaş, Cüneyt</au><au>Yücelbaş, Şule</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID‐19</atitle><jtitle>Concurrency and computation</jtitle><addtitle>Concurr Comput</addtitle><date>2022-12-25</date><risdate>2022</risdate><volume>34</volume><issue>28</issue><spage>e7393</spage><epage>n/a</epage><pages>e7393-n/a</pages><issn>1532-0626</issn><eissn>1532-0634</eissn><abstract>Summary Real‐time polymerase chain reaction (RT‐PCR) known as the swab test is a diagnostic test that can diagnose COVID‐19 disease through respiratory samples in the laboratory. Due to the rapid spread of the coronavirus around the world, the RT‐PCR test has become insufficient to get fast results. For this reason, the need for diagnostic methods to fill this gap has arisen and machine learning studies have started in this area. On the other hand, studying medical data is a challenging area because the data it contains is inconsistent, incomplete, difficult to scale, and very large. Additionally, some poor clinical decisions, irrelevant parameters, and limited medical data adversely affect the accuracy of studies performed. Therefore, considering the availability of datasets containing COVID‐19 blood parameters, which are less in number than other medical datasets today, it is aimed to improve these existing datasets. In this direction, to obtain more consistent results in COVID‐19 machine learning studies, the effect of data preprocessing techniques on the classification of COVID‐19 data was investigated in this study. In this study primarily, encoding categorical feature and feature scaling processes were applied to the dataset with 15 features that contain blood data of 279 patients, including gender and age information. Then, the missingness of the dataset was eliminated by using both K‐nearest neighbor algorithm (KNN) and chain equations multiple value assignment (MICE) methods. Data balancing has been done with synthetic minority oversampling technique (SMOTE), which is a data balancing method. The effect of data preprocessing techniques on ensemble learning algorithms bagging, AdaBoost, random forest and on popular classifier algorithms KNN classifier, support vector machine, logistic regression, artificial neural network, and decision tree classifiers have been analyzed. The highest accuracies obtained with the bagging classifier were 83.42% and 83.74% with KNN and MICE imputations by applying SMOTE, respectively. On the other hand, the highest accuracy ratio reached with the same classifier without SMOTE was 83.91% for the KNN imputation. In conclusion, certain data preprocessing techniques are examined comparatively and the effect of these data preprocessing techniques on success is presented and the importance of the right combination of data preprocessing to achieve success has been demonstrated by experimental studies.</abstract><cop>Hoboken, USA</cop><pub>John Wiley &amp; Sons, Inc</pub><pmid>36714180</pmid><doi>10.1002/cpe.7393</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0001-9347-9775</orcidid><orcidid>https://orcid.org/0000-0002-0255-5988</orcidid><orcidid>https://orcid.org/0000-0002-4005-6557</orcidid><orcidid>https://orcid.org/0000-0002-6758-8502</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1532-0626
ispartof Concurrency and computation, 2022-12, Vol.34 (28), p.e7393-n/a
issn 1532-0626
1532-0634
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_9874401
source Wiley Journals
subjects Algorithms
Artificial neural networks
Bagging
Balancing
Blood
Classifiers
COVID-19
Datasets
Decision analysis
Decision trees
Diagnostic systems
KNN imputation
Machine learning
multivariate imputation by chained equation
Parameters
Polymerase chain reaction
Preprocessing
Support vector machines
synthetic minority oversampling technique
Viral diseases
title Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID‐19
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T08%3A30%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Analyzing%20the%20effect%20of%20data%20preprocessing%20techniques%20using%20machine%20learning%20algorithms%20on%20the%20diagnosis%20of%20COVID%E2%80%9019&rft.jtitle=Concurrency%20and%20computation&rft.au=Erol,%20Gizemnur&rft.date=2022-12-25&rft.volume=34&rft.issue=28&rft.spage=e7393&rft.epage=n/a&rft.pages=e7393-n/a&rft.issn=1532-0626&rft.eissn=1532-0634&rft_id=info:doi/10.1002/cpe.7393&rft_dat=%3Cproquest_pubme%3E2771086958%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2738291250&rft_id=info:pmid/36714180&rfr_iscdi=true