PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning

Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cance...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Genomics (San Diego, Calif.) Calif.), 2022-03, Vol.114 (2), p.110264-110264, Article 110264
Hauptverfasser: Mahin, Kazi Ferdous, Robiuddin, Md, Islam, Mujahidul, Ashraf, Shayed, Yeasmin, Farjana, Shatabda, Swakkhar
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 110264
container_issue 2
container_start_page 110264
container_title Genomics (San Diego, Calif.)
container_volume 114
creator Mahin, Kazi Ferdous
Robiuddin, Md
Islam, Mujahidul
Ashraf, Shayed
Yeasmin, Farjana
Shatabda, Swakkhar
description Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif. •Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi.
doi_str_mv 10.1016/j.ygeno.2022.01.001
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2648836860</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0888754322000015</els_id><sourcerecordid>2648836860</sourcerecordid><originalsourceid>FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</originalsourceid><addsrcrecordid>eNqNkUtPGzEQgC1UBCHtL0BCPvayW3vtXXuROERRS5EQoKo9W7P2LDjaR2JvEPn3OA3lWHEayfPNyx8h55zlnPHq2yrfPeIw5gUripzxnDF-RGac6TrTlaw-kRnTWmeqlOKUnMW4YozVQhcn5FTIutZ1Uc9IeIBh2UGMvr2kN_06jM9-eKRrGKiFwWKg9pD1FiY_DnRsaUxEh9Ri19Ffd4ss4oamTZDiyzpgghPmYAK63ZO0B_vkU7ZDCEN6-EyOW-gifnmLc_Lnx_ffy5_Z7f31zXJxm1kp1JTxVlillQCOFcrSFc7y2jmuoFAIEqumlI451TaSQdO0UtUlQGnRcUgMiDn5euibjtpsMU6m93G_NAw4bqMpKqm1qHTFPoByXXKlhUyoOKA2jDEGbM06-B7CznBm9l7Myvz1YvZeDOMmeUlVF28Dtk2P7r3mn4gEXB0ATD_y7DGYaD0mAc4HtJNxo__vgFcU2qGV</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2618517834</pqid></control><display><type>article</type><title>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elsevier ScienceDirect Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Mahin, Kazi Ferdous ; Robiuddin, Md ; Islam, Mujahidul ; Ashraf, Shayed ; Yeasmin, Farjana ; Shatabda, Swakkhar</creator><creatorcontrib>Mahin, Kazi Ferdous ; Robiuddin, Md ; Islam, Mujahidul ; Ashraf, Shayed ; Yeasmin, Farjana ; Shatabda, Swakkhar</creatorcontrib><description>Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif. •Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi.</description><identifier>ISSN: 0888-7543</identifier><identifier>EISSN: 1089-8646</identifier><identifier>DOI: 10.1016/j.ygeno.2022.01.001</identifier><identifier>PMID: 34998929</identifier><language>eng</language><publisher>United States: Elsevier Inc</publisher><subject>Algorithms ; Cancer detection ; Classification ; computer software ; data collection ; death ; Gene Expression ; Gene Expression Profiling ; genome ; genomics ; Humans ; Machine Learning ; Neoplasms - diagnosis ; Neoplasms - genetics ; prediction ; RNA-Seq ; sequence analysis ; Sequence Analysis, RNA - methods ; Single cell RNA-Seq ; Software ; Software package</subject><ispartof>Genomics (San Diego, Calif.), 2022-03, Vol.114 (2), p.110264-110264, Article 110264</ispartof><rights>2022 The Authors</rights><rights>Copyright © 2022 The Authors. Published by Elsevier Inc. All rights reserved.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</citedby><cites>FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ygeno.2022.01.001$$EHTML$$P50$$Gelsevier$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,3536,27903,27904,45974</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/34998929$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Mahin, Kazi Ferdous</creatorcontrib><creatorcontrib>Robiuddin, Md</creatorcontrib><creatorcontrib>Islam, Mujahidul</creatorcontrib><creatorcontrib>Ashraf, Shayed</creatorcontrib><creatorcontrib>Yeasmin, Farjana</creatorcontrib><creatorcontrib>Shatabda, Swakkhar</creatorcontrib><title>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</title><title>Genomics (San Diego, Calif.)</title><addtitle>Genomics</addtitle><description>Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif. •Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi.</description><subject>Algorithms</subject><subject>Cancer detection</subject><subject>Classification</subject><subject>computer software</subject><subject>data collection</subject><subject>death</subject><subject>Gene Expression</subject><subject>Gene Expression Profiling</subject><subject>genome</subject><subject>genomics</subject><subject>Humans</subject><subject>Machine Learning</subject><subject>Neoplasms - diagnosis</subject><subject>Neoplasms - genetics</subject><subject>prediction</subject><subject>RNA-Seq</subject><subject>sequence analysis</subject><subject>Sequence Analysis, RNA - methods</subject><subject>Single cell RNA-Seq</subject><subject>Software</subject><subject>Software package</subject><issn>0888-7543</issn><issn>1089-8646</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqNkUtPGzEQgC1UBCHtL0BCPvayW3vtXXuROERRS5EQoKo9W7P2LDjaR2JvEPn3OA3lWHEayfPNyx8h55zlnPHq2yrfPeIw5gUripzxnDF-RGac6TrTlaw-kRnTWmeqlOKUnMW4YozVQhcn5FTIutZ1Uc9IeIBh2UGMvr2kN_06jM9-eKRrGKiFwWKg9pD1FiY_DnRsaUxEh9Ri19Ffd4ss4oamTZDiyzpgghPmYAK63ZO0B_vkU7ZDCEN6-EyOW-gifnmLc_Lnx_ffy5_Z7f31zXJxm1kp1JTxVlillQCOFcrSFc7y2jmuoFAIEqumlI451TaSQdO0UtUlQGnRcUgMiDn5euibjtpsMU6m93G_NAw4bqMpKqm1qHTFPoByXXKlhUyoOKA2jDEGbM06-B7CznBm9l7Myvz1YvZeDOMmeUlVF28Dtk2P7r3mn4gEXB0ATD_y7DGYaD0mAc4HtJNxo__vgFcU2qGV</recordid><startdate>202203</startdate><enddate>202203</enddate><creator>Mahin, Kazi Ferdous</creator><creator>Robiuddin, Md</creator><creator>Islam, Mujahidul</creator><creator>Ashraf, Shayed</creator><creator>Yeasmin, Farjana</creator><creator>Shatabda, Swakkhar</creator><general>Elsevier Inc</general><scope>6I.</scope><scope>AAFTH</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>7S9</scope><scope>L.6</scope></search><sort><creationdate>202203</creationdate><title>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</title><author>Mahin, Kazi Ferdous ; Robiuddin, Md ; Islam, Mujahidul ; Ashraf, Shayed ; Yeasmin, Farjana ; Shatabda, Swakkhar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Cancer detection</topic><topic>Classification</topic><topic>computer software</topic><topic>data collection</topic><topic>death</topic><topic>Gene Expression</topic><topic>Gene Expression Profiling</topic><topic>genome</topic><topic>genomics</topic><topic>Humans</topic><topic>Machine Learning</topic><topic>Neoplasms - diagnosis</topic><topic>Neoplasms - genetics</topic><topic>prediction</topic><topic>RNA-Seq</topic><topic>sequence analysis</topic><topic>Sequence Analysis, RNA - methods</topic><topic>Single cell RNA-Seq</topic><topic>Software</topic><topic>Software package</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mahin, Kazi Ferdous</creatorcontrib><creatorcontrib>Robiuddin, Md</creatorcontrib><creatorcontrib>Islam, Mujahidul</creatorcontrib><creatorcontrib>Ashraf, Shayed</creatorcontrib><creatorcontrib>Yeasmin, Farjana</creatorcontrib><creatorcontrib>Shatabda, Swakkhar</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>AGRICOLA</collection><collection>AGRICOLA - Academic</collection><jtitle>Genomics (San Diego, Calif.)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mahin, Kazi Ferdous</au><au>Robiuddin, Md</au><au>Islam, Mujahidul</au><au>Ashraf, Shayed</au><au>Yeasmin, Farjana</au><au>Shatabda, Swakkhar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</atitle><jtitle>Genomics (San Diego, Calif.)</jtitle><addtitle>Genomics</addtitle><date>2022-03</date><risdate>2022</risdate><volume>114</volume><issue>2</issue><spage>110264</spage><epage>110264</epage><pages>110264-110264</pages><artnum>110264</artnum><issn>0888-7543</issn><eissn>1089-8646</eissn><abstract>Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif. •Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi.</abstract><cop>United States</cop><pub>Elsevier Inc</pub><pmid>34998929</pmid><doi>10.1016/j.ygeno.2022.01.001</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0888-7543
ispartof Genomics (San Diego, Calif.), 2022-03, Vol.114 (2), p.110264-110264, Article 110264
issn 0888-7543
1089-8646
language eng
recordid cdi_proquest_miscellaneous_2648836860
source MEDLINE; DOAJ Directory of Open Access Journals; Elsevier ScienceDirect Journals; EZB-FREE-00999 freely available EZB journals
subjects Algorithms
Cancer detection
Classification
computer software
data collection
death
Gene Expression
Gene Expression Profiling
genome
genomics
Humans
Machine Learning
Neoplasms - diagnosis
Neoplasms - genetics
prediction
RNA-Seq
sequence analysis
Sequence Analysis, RNA - methods
Single cell RNA-Seq
Software
Software package
title PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T06%3A19%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PanClassif:%20Improving%20pan%20cancer%20classification%20of%20single%20cell%20RNA-seq%20gene%20expression%20data%20using%20machine%20learning&rft.jtitle=Genomics%20(San%20Diego,%20Calif.)&rft.au=Mahin,%20Kazi%20Ferdous&rft.date=2022-03&rft.volume=114&rft.issue=2&rft.spage=110264&rft.epage=110264&rft.pages=110264-110264&rft.artnum=110264&rft.issn=0888-7543&rft.eissn=1089-8646&rft_id=info:doi/10.1016/j.ygeno.2022.01.001&rft_dat=%3Cproquest_cross%3E2648836860%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2618517834&rft_id=info:pmid/34998929&rft_els_id=S0888754322000015&rfr_iscdi=true