PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning
Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cance...
Gespeichert in:
Veröffentlicht in: | Genomics (San Diego, Calif.) Calif.), 2022-03, Vol.114 (2), p.110264-110264, Article 110264 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 110264 |
---|---|
container_issue | 2 |
container_start_page | 110264 |
container_title | Genomics (San Diego, Calif.) |
container_volume | 114 |
creator | Mahin, Kazi Ferdous Robiuddin, Md Islam, Mujahidul Ashraf, Shayed Yeasmin, Farjana Shatabda, Swakkhar |
description | Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif.
•Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi. |
doi_str_mv | 10.1016/j.ygeno.2022.01.001 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2648836860</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0888754322000015</els_id><sourcerecordid>2648836860</sourcerecordid><originalsourceid>FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</originalsourceid><addsrcrecordid>eNqNkUtPGzEQgC1UBCHtL0BCPvayW3vtXXuROERRS5EQoKo9W7P2LDjaR2JvEPn3OA3lWHEayfPNyx8h55zlnPHq2yrfPeIw5gUripzxnDF-RGac6TrTlaw-kRnTWmeqlOKUnMW4YozVQhcn5FTIutZ1Uc9IeIBh2UGMvr2kN_06jM9-eKRrGKiFwWKg9pD1FiY_DnRsaUxEh9Ri19Ffd4ss4oamTZDiyzpgghPmYAK63ZO0B_vkU7ZDCEN6-EyOW-gifnmLc_Lnx_ffy5_Z7f31zXJxm1kp1JTxVlillQCOFcrSFc7y2jmuoFAIEqumlI451TaSQdO0UtUlQGnRcUgMiDn5euibjtpsMU6m93G_NAw4bqMpKqm1qHTFPoByXXKlhUyoOKA2jDEGbM06-B7CznBm9l7Myvz1YvZeDOMmeUlVF28Dtk2P7r3mn4gEXB0ATD_y7DGYaD0mAc4HtJNxo__vgFcU2qGV</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2618517834</pqid></control><display><type>article</type><title>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elsevier ScienceDirect Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Mahin, Kazi Ferdous ; Robiuddin, Md ; Islam, Mujahidul ; Ashraf, Shayed ; Yeasmin, Farjana ; Shatabda, Swakkhar</creator><creatorcontrib>Mahin, Kazi Ferdous ; Robiuddin, Md ; Islam, Mujahidul ; Ashraf, Shayed ; Yeasmin, Farjana ; Shatabda, Swakkhar</creatorcontrib><description>Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif.
•Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi.</description><identifier>ISSN: 0888-7543</identifier><identifier>EISSN: 1089-8646</identifier><identifier>DOI: 10.1016/j.ygeno.2022.01.001</identifier><identifier>PMID: 34998929</identifier><language>eng</language><publisher>United States: Elsevier Inc</publisher><subject>Algorithms ; Cancer detection ; Classification ; computer software ; data collection ; death ; Gene Expression ; Gene Expression Profiling ; genome ; genomics ; Humans ; Machine Learning ; Neoplasms - diagnosis ; Neoplasms - genetics ; prediction ; RNA-Seq ; sequence analysis ; Sequence Analysis, RNA - methods ; Single cell RNA-Seq ; Software ; Software package</subject><ispartof>Genomics (San Diego, Calif.), 2022-03, Vol.114 (2), p.110264-110264, Article 110264</ispartof><rights>2022 The Authors</rights><rights>Copyright © 2022 The Authors. Published by Elsevier Inc. All rights reserved.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</citedby><cites>FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ygeno.2022.01.001$$EHTML$$P50$$Gelsevier$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,3536,27903,27904,45974</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/34998929$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Mahin, Kazi Ferdous</creatorcontrib><creatorcontrib>Robiuddin, Md</creatorcontrib><creatorcontrib>Islam, Mujahidul</creatorcontrib><creatorcontrib>Ashraf, Shayed</creatorcontrib><creatorcontrib>Yeasmin, Farjana</creatorcontrib><creatorcontrib>Shatabda, Swakkhar</creatorcontrib><title>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</title><title>Genomics (San Diego, Calif.)</title><addtitle>Genomics</addtitle><description>Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif.
•Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi.</description><subject>Algorithms</subject><subject>Cancer detection</subject><subject>Classification</subject><subject>computer software</subject><subject>data collection</subject><subject>death</subject><subject>Gene Expression</subject><subject>Gene Expression Profiling</subject><subject>genome</subject><subject>genomics</subject><subject>Humans</subject><subject>Machine Learning</subject><subject>Neoplasms - diagnosis</subject><subject>Neoplasms - genetics</subject><subject>prediction</subject><subject>RNA-Seq</subject><subject>sequence analysis</subject><subject>Sequence Analysis, RNA - methods</subject><subject>Single cell RNA-Seq</subject><subject>Software</subject><subject>Software package</subject><issn>0888-7543</issn><issn>1089-8646</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqNkUtPGzEQgC1UBCHtL0BCPvayW3vtXXuROERRS5EQoKo9W7P2LDjaR2JvEPn3OA3lWHEayfPNyx8h55zlnPHq2yrfPeIw5gUripzxnDF-RGac6TrTlaw-kRnTWmeqlOKUnMW4YozVQhcn5FTIutZ1Uc9IeIBh2UGMvr2kN_06jM9-eKRrGKiFwWKg9pD1FiY_DnRsaUxEh9Ri19Ffd4ss4oamTZDiyzpgghPmYAK63ZO0B_vkU7ZDCEN6-EyOW-gifnmLc_Lnx_ffy5_Z7f31zXJxm1kp1JTxVlillQCOFcrSFc7y2jmuoFAIEqumlI451TaSQdO0UtUlQGnRcUgMiDn5euibjtpsMU6m93G_NAw4bqMpKqm1qHTFPoByXXKlhUyoOKA2jDEGbM06-B7CznBm9l7Myvz1YvZeDOMmeUlVF28Dtk2P7r3mn4gEXB0ATD_y7DGYaD0mAc4HtJNxo__vgFcU2qGV</recordid><startdate>202203</startdate><enddate>202203</enddate><creator>Mahin, Kazi Ferdous</creator><creator>Robiuddin, Md</creator><creator>Islam, Mujahidul</creator><creator>Ashraf, Shayed</creator><creator>Yeasmin, Farjana</creator><creator>Shatabda, Swakkhar</creator><general>Elsevier Inc</general><scope>6I.</scope><scope>AAFTH</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>7S9</scope><scope>L.6</scope></search><sort><creationdate>202203</creationdate><title>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</title><author>Mahin, Kazi Ferdous ; Robiuddin, Md ; Islam, Mujahidul ; Ashraf, Shayed ; Yeasmin, Farjana ; Shatabda, Swakkhar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c437t-1f3c7873a1e6e45d2dc19dd17a27ea4e6b54d0d7fb40abbf4795aa5ced1a7a2a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Cancer detection</topic><topic>Classification</topic><topic>computer software</topic><topic>data collection</topic><topic>death</topic><topic>Gene Expression</topic><topic>Gene Expression Profiling</topic><topic>genome</topic><topic>genomics</topic><topic>Humans</topic><topic>Machine Learning</topic><topic>Neoplasms - diagnosis</topic><topic>Neoplasms - genetics</topic><topic>prediction</topic><topic>RNA-Seq</topic><topic>sequence analysis</topic><topic>Sequence Analysis, RNA - methods</topic><topic>Single cell RNA-Seq</topic><topic>Software</topic><topic>Software package</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mahin, Kazi Ferdous</creatorcontrib><creatorcontrib>Robiuddin, Md</creatorcontrib><creatorcontrib>Islam, Mujahidul</creatorcontrib><creatorcontrib>Ashraf, Shayed</creatorcontrib><creatorcontrib>Yeasmin, Farjana</creatorcontrib><creatorcontrib>Shatabda, Swakkhar</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>AGRICOLA</collection><collection>AGRICOLA - Academic</collection><jtitle>Genomics (San Diego, Calif.)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mahin, Kazi Ferdous</au><au>Robiuddin, Md</au><au>Islam, Mujahidul</au><au>Ashraf, Shayed</au><au>Yeasmin, Farjana</au><au>Shatabda, Swakkhar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning</atitle><jtitle>Genomics (San Diego, Calif.)</jtitle><addtitle>Genomics</addtitle><date>2022-03</date><risdate>2022</risdate><volume>114</volume><issue>2</issue><spage>110264</spage><epage>110264</epage><pages>110264-110264</pages><artnum>110264</artnum><issn>0888-7543</issn><eissn>1089-8646</eissn><abstract>Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif.
•Effective gene selection strategy for cancer classification from single cell RNA-Seq data.•An improved machine learning based classification method for binary and pan cancer classification.•A publicly available python based tool in PyPi.</abstract><cop>United States</cop><pub>Elsevier Inc</pub><pmid>34998929</pmid><doi>10.1016/j.ygeno.2022.01.001</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0888-7543 |
ispartof | Genomics (San Diego, Calif.), 2022-03, Vol.114 (2), p.110264-110264, Article 110264 |
issn | 0888-7543 1089-8646 |
language | eng |
recordid | cdi_proquest_miscellaneous_2648836860 |
source | MEDLINE; DOAJ Directory of Open Access Journals; Elsevier ScienceDirect Journals; EZB-FREE-00999 freely available EZB journals |
subjects | Algorithms Cancer detection Classification computer software data collection death Gene Expression Gene Expression Profiling genome genomics Humans Machine Learning Neoplasms - diagnosis Neoplasms - genetics prediction RNA-Seq sequence analysis Sequence Analysis, RNA - methods Single cell RNA-Seq Software Software package |
title | PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-22T06%3A19%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PanClassif:%20Improving%20pan%20cancer%20classification%20of%20single%20cell%20RNA-seq%20gene%20expression%20data%20using%20machine%20learning&rft.jtitle=Genomics%20(San%20Diego,%20Calif.)&rft.au=Mahin,%20Kazi%20Ferdous&rft.date=2022-03&rft.volume=114&rft.issue=2&rft.spage=110264&rft.epage=110264&rft.pages=110264-110264&rft.artnum=110264&rft.issn=0888-7543&rft.eissn=1089-8646&rft_id=info:doi/10.1016/j.ygeno.2022.01.001&rft_dat=%3Cproquest_cross%3E2648836860%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2618517834&rft_id=info:pmid/34998929&rft_els_id=S0888754322000015&rfr_iscdi=true |