PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning

Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cance...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Genomics (San Diego, Calif.) Calif.), 2022-03, Vol.114 (2), p.110264-110264, Article 110264
Hauptverfasser:	Mahin, Kazi Ferdous, Robiuddin, Md, Islam, Mujahidul, Ashraf, Shayed, Yeasmin, Farjana, Shatabda, Swakkhar
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Cancer detection Classification computer software data collection death Gene Expression Gene Expression Profiling genome genomics Humans Machine Learning Neoplasms - diagnosis Neoplasms - genetics prediction RNA-Seq sequence analysis Sequence Analysis, RNA - methods Single cell RNA-Seq Software Software package
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	110264
container_issue	2
container_start_page	110264
container_title	Genomics (San Diego, Calif.)
container_volume	114
creator	Mahin, Kazi Ferdous Robiuddin, Md Islam, Mujahidul Ashraf, Shayed Yeasmin, Farjana Shatabda, Swakkhar
description	Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif. •Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi.
doi_str_mv	10.1016/j.ygeno.2022.01.001
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2648836860</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0888754322000015</els_id><sourcerecordid>2648836860</sourcerecordid><originalsourceid>FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</originalsourceid><addsrcrecordid>eNqNkUtPGzEQgC1UBCHtL0BCPvayW3vtXXuROERRS5EQoKo9W7P2LDjaR2JvEPn3OA3lWHEayfPNyx8h55zlnPHq2yrfPeIw5gUripzxnDF-RGac6TrTlaw-kRnTWmeqlOKUnMW4YozVQhcn5FTIutZ1Uc9IeIBh2UGMvr2kN_06jM9-eKRrGKiFwWKg9pD1FiY_DnRsaUxEh9Ri19Ffd4ss4oamTZDiyzpgghPmYAK63ZO0B_vkU7ZDCEN6-EyOW-gifnmLc_Lnx_ffy5_Z7f31zXJxm1kp1JTxVlillQCOFcrSFc7y2jmuoFAIEqumlI451TaSQdO0UtUlQGnRcUgMiDn5euibjtpsMU6m93G_NAw4bqMpKqm1qHTFPoByXXKlhUyoOKA2jDEGbM06-B7CznBm9l7Myvz1YvZeDOMmeUlVF28Dtk2P7r3mn4gEXB0ATD_y7DGYaD0mAc4HtJNxo__vgFcU2qGV</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2618517834</pqid></control><display><type>article</type><title>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elsevier ScienceDirect Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Mahin, Kazi Ferdous ; Robiuddin, Md ; Islam, Mujahidul ; Ashraf, Shayed ; Yeasmin, Farjana ; Shatabda, Swakkhar</creator><creatorcontrib>Mahin, Kazi Ferdous ; Robiuddin, Md ; Islam, Mujahidul ; Ashraf, Shayed ; Yeasmin, Farjana ; Shatabda, Swakkhar</creatorcontrib><description>Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif. •Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi.</description><identifier>ISSN: 0888-7543</identifier><identifier>EISSN: 1089-8646</identifier><identifier>DOI: 10.1016/j.ygeno.2022.01.001</identifier><identifier>PMID: 34998929</identifier><language>eng</language><publisher>United States: Elsevier Inc</publisher><subject>Algorithms ; Cancer detection ; Classification ; computer software ; data collection ; death ; Gene Expression ; Gene Expression Profiling ; genome ; genomics ; Humans ; Machine Learning ; Neoplasms - diagnosis ; Neoplasms - genetics ; prediction ; RNA-Seq ; sequence analysis ; Sequence Analysis, RNA - methods ; Single cell RNA-Seq ; Software ; Software package</subject><ispartof>Genomics (San Diego, Calif.), 2022-03, Vol.114 (2), p.110264-110264, Article 110264</ispartof><rights>2022 The Authors</rights><rights>Copyright © 2022 The Authors. Published by Elsevier Inc. All rights reserved.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</citedby><cites>FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ygeno.2022.01.001$$EHTML$$P50$$Gelsevier$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,3536,27903,27904,45974</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/34998929$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Mahin, Kazi Ferdous</creatorcontrib><creatorcontrib>Robiuddin, Md</creatorcontrib><creatorcontrib>Islam, Mujahidul</creatorcontrib><creatorcontrib>Ashraf, Shayed</creatorcontrib><creatorcontrib>Yeasmin, Farjana</creatorcontrib><creatorcontrib>Shatabda, Swakkhar</creatorcontrib><title>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</title><title>Genomics (San Diego, Calif.)</title><addtitle>Genomics</addtitle><description>Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif. •Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi.</description><subject>Algorithms</subject><subject>Cancer detection</subject><subject>Classification</subject><subject>computer software</subject><subject>data collection</subject><subject>death</subject><subject>Gene Expression</subject><subject>Gene Expression Profiling</subject><subject>genome</subject><subject>genomics</subject><subject>Humans</subject><subject>Machine Learning</subject><subject>Neoplasms - diagnosis</subject><subject>Neoplasms - genetics</subject><subject>prediction</subject><subject>RNA-Seq</subject><subject>sequence analysis</subject><subject>Sequence Analysis, RNA - methods</subject><subject>Single cell RNA-Seq</subject><subject>Software</subject><subject>Software package</subject><issn>0888-7543</issn><issn>1089-8646</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqNkUtPGzEQgC1UBCHtL0BCPvayW3vtXXuROERRS5EQoKo9W7P2LDjaR2JvEPn3OA3lWHEayfPNyx8h55zlnPHq2yrfPeIw5gUripzxnDF-RGac6TrTlaw-kRnTWmeqlOKUnMW4YozVQhcn5FTIutZ1Uc9IeIBh2UGMvr2kN_06jM9-eKRrGKiFwWKg9pD1FiY_DnRsaUxEh9Ri19Ffd4ss4oamTZDiyzpgghPmYAK63ZO0B_vkU7ZDCEN6-EyOW-gifnmLc_Lnx_ffy5_Z7f31zXJxm1kp1JTxVlillQCOFcrSFc7y2jmuoFAIEqumlI451TaSQdO0UtUlQGnRcUgMiDn5euibjtpsMU6m93G_NAw4bqMpKqm1qHTFPoByXXKlhUyoOKA2jDEGbM06-B7CznBm9l7Myvz1YvZeDOMmeUlVF28Dtk2P7r3mn4gEXB0ATD_y7DGYaD0mAc4HtJNxo__vgFcU2qGV</recordid><startdate>202203</startdate><enddate>202203</enddate><creator>Mahin, Kazi Ferdous</creator><creator>Robiuddin, Md</creator><creator>Islam, Mujahidul</creator><creator>Ashraf, Shayed</creator><creator>Yeasmin, Farjana</creator><creator>Shatabda, Swakkhar</creator><general>Elsevier Inc</general><scope>6I.</scope><scope>AAFTH</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>7S9</scope><scope>L.6</scope></search><sort><creationdate>202203</creationdate><title>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</title><author>Mahin, Kazi Ferdous ; Robiuddin, Md ; Islam, Mujahidul ; Ashraf, Shayed ; Yeasmin, Farjana ; Shatabda, Swakkhar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Cancer detection</topic><topic>Classification</topic><topic>computer software</topic><topic>data collection</topic><topic>death</topic><topic>Gene Expression</topic><topic>Gene Expression Profiling</topic><topic>genome</topic><topic>genomics</topic><topic>Humans</topic><topic>Machine Learning</topic><topic>Neoplasms - diagnosis</topic><topic>Neoplasms - genetics</topic><topic>prediction</topic><topic>RNA-Seq</topic><topic>sequence analysis</topic><topic>Sequence Analysis, RNA - methods</topic><topic>Single cell RNA-Seq</topic><topic>Software</topic><topic>Software package</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mahin, Kazi Ferdous</creatorcontrib><creatorcontrib>Robiuddin, Md</creatorcontrib><creatorcontrib>Islam, Mujahidul</creatorcontrib><creatorcontrib>Ashraf, Shayed</creatorcontrib><creatorcontrib>Yeasmin, Farjana</creatorcontrib><creatorcontrib>Shatabda, Swakkhar</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>AGRICOLA</collection><collection>AGRICOLA - Academic</collection><jtitle>Genomics (San Diego, Calif.)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mahin, Kazi Ferdous</au><au>Robiuddin, Md</au><au>Islam, Mujahidul</au><au>Ashraf, Shayed</au><au>Yeasmin, Farjana</au><au>Shatabda, Swakkhar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</atitle><jtitle>Genomics (San Diego, Calif.)</jtitle><addtitle>Genomics</addtitle><date>2022-03</date><risdate>2022</risdate><volume>114</volume><issue>2</issue><spage>110264</spage><epage>110264</epage><pages>110264-110264</pages><artnum>110264</artnum><issn>0888-7543</issn><eissn>1089-8646</eissn><abstract>Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif. •Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi.</abstract><cop>United States</cop><pub>Elsevier Inc</pub><pmid>34998929</pmid><doi>10.1016/j.ygeno.2022.01.001</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0888-7543
ispartof	Genomics (San Diego, Calif.), 2022-03, Vol.114 (2), p.110264-110264, Article 110264
issn	0888-7543 1089-8646
language	eng
recordid	cdi_proquest_miscellaneous_2648836860
source	MEDLINE; DOAJ Directory of Open Access Journals; Elsevier ScienceDirect Journals; EZB-FREE-00999 freely available EZB journals
subjects	Algorithms Cancer detection Classification computer software data collection death Gene Expression Gene Expression Profiling genome genomics Humans Machine Learning Neoplasms - diagnosis Neoplasms - genetics prediction RNA-Seq sequence analysis Sequence Analysis, RNA - methods Single cell RNA-Seq Software Software package
title	PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T06%3A19%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PanClassif:%20Improving%20pan%20cancer%20classification%20of%20single%20cell%20RNA-seq%20gene%20expression%20data%20using%20machine%20learning&rft.jtitle=Genomics%20(San%20Diego,%20Calif.)&rft.au=Mahin,%20Kazi%20Ferdous&rft.date=2022-03&rft.volume=114&rft.issue=2&rft.spage=110264&rft.epage=110264&rft.pages=110264-110264&rft.artnum=110264&rft.issn=0888-7543&rft.eissn=1089-8646&rft_id=info:doi/10.1016/j.ygeno.2022.01.001&rft_dat=%3Cproquest_cross%3E2648836860%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2618517834&rft_id=info:pmid/34998929&rft_els_id=S0888754322000015&rfr_iscdi=true