Deep Learning Benchmarks on L1000 Gene Expression Data

Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hamp...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on computational biology and bioinformatics 2020-11, Vol.17 (6), p.1846-1857
Hauptverfasser: McDermott, Matthew B.A., Wang, Jennifer, Zhao, Wen-Ning, Sheridan, Steven D., Szolovits, Peter, Kohane, Isaac, Haggarty, Stephen J., Perlis, Roy H.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1857
container_issue 6
container_start_page 1846
container_title IEEE/ACM transactions on computational biology and bioinformatics
container_volume 17
creator McDermott, Matthew B.A.
Wang, Jennifer
Zhao, Wen-Ning
Sheridan, Steven D.
Szolovits, Peter
Kohane, Isaac
Haggarty, Stephen J.
Perlis, Roy H.
description Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.
doi_str_mv 10.1109/TCBB.2019.2910061
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_8686113</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8686113</ieee_id><sourcerecordid>2468772759</sourcerecordid><originalsourceid>FETCH-LOGICAL-c447t-b9c42f7995404a6021b2f4c40e909060bf70a21ed697d3a9ede7ec33589bb29c3</originalsourceid><addsrcrecordid>eNpdkVtrGzEQhUVoqXPpDwiFspCXvKwzuqy081KoL3UKhrwkz0Irzyab2lpXskv67ytj1yR5mmHmO4cZDmOXHIacA97cj0ejoQCOQ4EcQPMTdsqrypSIWn3Y9aoqK9RywM5SegYQCkF9YgMJiFkGp0xPiNbFnFwMXXgsRhT808rFX6noQzHPplDMKFAxfVlHSqnL04nbuAv2sXXLRJ8P9Zw9_Jjej2_L-d3s5_j7vPRKmU3ZoFeiNYiVAuU0CN6IVnkFhICgoWkNOMFpodEspENakCEvZVVj0wj08px92_uut82KFp7CJrqlXccuH_nX9q6zbzehe7KP_R-rsQapZTa4PhjE_veW0sauuuRpuXSB-m2yQnAQuuJcZfTqHfrcb2PI71mhdG2MMBVmiu8pH_uUIrXHYzjYXSp2l4rdpWIPqWTN19dfHBX_Y8jAlz3QEdFxXetacy7lP-HLjlM</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2468772759</pqid></control><display><type>article</type><title>Deep Learning Benchmarks on L1000 Gene Expression Data</title><source>IEEE Electronic Library (IEL)</source><creator>McDermott, Matthew B.A. ; Wang, Jennifer ; Zhao, Wen-Ning ; Sheridan, Steven D. ; Szolovits, Peter ; Kohane, Isaac ; Haggarty, Stephen J. ; Perlis, Roy H.</creator><creatorcontrib>McDermott, Matthew B.A. ; Wang, Jennifer ; Zhao, Wen-Ning ; Sheridan, Steven D. ; Szolovits, Peter ; Kohane, Isaac ; Haggarty, Stephen J. ; Perlis, Roy H.</creatorcontrib><description>Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.</description><identifier>ISSN: 1545-5963</identifier><identifier>EISSN: 1557-9964</identifier><identifier>DOI: 10.1109/TCBB.2019.2910061</identifier><identifier>PMID: 30990190</identifier><identifier>CODEN: ITCBCY</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Artificial neural networks ; Benchmark testing ; Benchmarks ; Biological system modeling ; Classifiers ; Data models ; Datasets ; Decision trees ; Deep learning ; Gene expression ; gene expression data ; Genomes ; Learning algorithms ; Machine learning ; model development ; Neural networks</subject><ispartof>IEEE/ACM transactions on computational biology and bioinformatics, 2020-11, Vol.17 (6), p.1846-1857</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c447t-b9c42f7995404a6021b2f4c40e909060bf70a21ed697d3a9ede7ec33589bb29c3</citedby><cites>FETCH-LOGICAL-c447t-b9c42f7995404a6021b2f4c40e909060bf70a21ed697d3a9ede7ec33589bb29c3</cites><orcidid>0000-0002-5862-6757 ; 0000-0001-6048-9707</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8686113$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,314,780,784,796,885,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8686113$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/30990190$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>McDermott, Matthew B.A.</creatorcontrib><creatorcontrib>Wang, Jennifer</creatorcontrib><creatorcontrib>Zhao, Wen-Ning</creatorcontrib><creatorcontrib>Sheridan, Steven D.</creatorcontrib><creatorcontrib>Szolovits, Peter</creatorcontrib><creatorcontrib>Kohane, Isaac</creatorcontrib><creatorcontrib>Haggarty, Stephen J.</creatorcontrib><creatorcontrib>Perlis, Roy H.</creatorcontrib><title>Deep Learning Benchmarks on L1000 Gene Expression Data</title><title>IEEE/ACM transactions on computational biology and bioinformatics</title><addtitle>TCBB</addtitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><description>Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.</description><subject>Artificial neural networks</subject><subject>Benchmark testing</subject><subject>Benchmarks</subject><subject>Biological system modeling</subject><subject>Classifiers</subject><subject>Data models</subject><subject>Datasets</subject><subject>Decision trees</subject><subject>Deep learning</subject><subject>Gene expression</subject><subject>gene expression data</subject><subject>Genomes</subject><subject>Learning algorithms</subject><subject>Machine learning</subject><subject>model development</subject><subject>Neural networks</subject><issn>1545-5963</issn><issn>1557-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkVtrGzEQhUVoqXPpDwiFspCXvKwzuqy081KoL3UKhrwkz0Irzyab2lpXskv67ytj1yR5mmHmO4cZDmOXHIacA97cj0ejoQCOQ4EcQPMTdsqrypSIWn3Y9aoqK9RywM5SegYQCkF9YgMJiFkGp0xPiNbFnFwMXXgsRhT808rFX6noQzHPplDMKFAxfVlHSqnL04nbuAv2sXXLRJ8P9Zw9_Jjej2_L-d3s5_j7vPRKmU3ZoFeiNYiVAuU0CN6IVnkFhICgoWkNOMFpodEspENakCEvZVVj0wj08px92_uut82KFp7CJrqlXccuH_nX9q6zbzehe7KP_R-rsQapZTa4PhjE_veW0sauuuRpuXSB-m2yQnAQuuJcZfTqHfrcb2PI71mhdG2MMBVmiu8pH_uUIrXHYzjYXSp2l4rdpWIPqWTN19dfHBX_Y8jAlz3QEdFxXetacy7lP-HLjlM</recordid><startdate>20201101</startdate><enddate>20201101</enddate><creator>McDermott, Matthew B.A.</creator><creator>Wang, Jennifer</creator><creator>Zhao, Wen-Ning</creator><creator>Sheridan, Steven D.</creator><creator>Szolovits, Peter</creator><creator>Kohane, Isaac</creator><creator>Haggarty, Stephen J.</creator><creator>Perlis, Roy H.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-5862-6757</orcidid><orcidid>https://orcid.org/0000-0001-6048-9707</orcidid></search><sort><creationdate>20201101</creationdate><title>Deep Learning Benchmarks on L1000 Gene Expression Data</title><author>McDermott, Matthew B.A. ; Wang, Jennifer ; Zhao, Wen-Ning ; Sheridan, Steven D. ; Szolovits, Peter ; Kohane, Isaac ; Haggarty, Stephen J. ; Perlis, Roy H.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c447t-b9c42f7995404a6021b2f4c40e909060bf70a21ed697d3a9ede7ec33589bb29c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Artificial neural networks</topic><topic>Benchmark testing</topic><topic>Benchmarks</topic><topic>Biological system modeling</topic><topic>Classifiers</topic><topic>Data models</topic><topic>Datasets</topic><topic>Decision trees</topic><topic>Deep learning</topic><topic>Gene expression</topic><topic>gene expression data</topic><topic>Genomes</topic><topic>Learning algorithms</topic><topic>Machine learning</topic><topic>model development</topic><topic>Neural networks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>McDermott, Matthew B.A.</creatorcontrib><creatorcontrib>Wang, Jennifer</creatorcontrib><creatorcontrib>Zhao, Wen-Ning</creatorcontrib><creatorcontrib>Sheridan, Steven D.</creatorcontrib><creatorcontrib>Szolovits, Peter</creatorcontrib><creatorcontrib>Kohane, Isaac</creatorcontrib><creatorcontrib>Haggarty, Stephen J.</creatorcontrib><creatorcontrib>Perlis, Roy H.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>McDermott, Matthew B.A.</au><au>Wang, Jennifer</au><au>Zhao, Wen-Ning</au><au>Sheridan, Steven D.</au><au>Szolovits, Peter</au><au>Kohane, Isaac</au><au>Haggarty, Stephen J.</au><au>Perlis, Roy H.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Deep Learning Benchmarks on L1000 Gene Expression Data</atitle><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle><stitle>TCBB</stitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><date>2020-11-01</date><risdate>2020</risdate><volume>17</volume><issue>6</issue><spage>1846</spage><epage>1857</epage><pages>1846-1857</pages><issn>1545-5963</issn><eissn>1557-9964</eissn><coden>ITCBCY</coden><abstract>Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>30990190</pmid><doi>10.1109/TCBB.2019.2910061</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-5862-6757</orcidid><orcidid>https://orcid.org/0000-0001-6048-9707</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1545-5963
ispartof IEEE/ACM transactions on computational biology and bioinformatics, 2020-11, Vol.17 (6), p.1846-1857
issn 1545-5963
1557-9964
language eng
recordid cdi_ieee_primary_8686113
source IEEE Electronic Library (IEL)
subjects Artificial neural networks
Benchmark testing
Benchmarks
Biological system modeling
Classifiers
Data models
Datasets
Decision trees
Deep learning
Gene expression
gene expression data
Genomes
Learning algorithms
Machine learning
model development
Neural networks
title Deep Learning Benchmarks on L1000 Gene Expression Data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T18%3A57%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Deep%20Learning%20Benchmarks%20on%20L1000%20Gene%20Expression%20Data&rft.jtitle=IEEE/ACM%20transactions%20on%20computational%20biology%20and%20bioinformatics&rft.au=McDermott,%20Matthew%20B.A.&rft.date=2020-11-01&rft.volume=17&rft.issue=6&rft.spage=1846&rft.epage=1857&rft.pages=1846-1857&rft.issn=1545-5963&rft.eissn=1557-9964&rft.coden=ITCBCY&rft_id=info:doi/10.1109/TCBB.2019.2910061&rft_dat=%3Cproquest_RIE%3E2468772759%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2468772759&rft_id=info:pmid/30990190&rft_ieee_id=8686113&rfr_iscdi=true