Deep Learning Benchmarks on L1000 Gene Expression Data

Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hamp...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on computational biology and bioinformatics 2020-11, Vol.17 (6), p.1846-1857
Hauptverfasser:	McDermott, Matthew B.A., Wang, Jennifer, Zhao, Wen-Ning, Sheridan, Steven D., Szolovits, Peter, Kohane, Isaac, Haggarty, Stephen J., Perlis, Roy H.
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Benchmark testing Benchmarks Biological system modeling Classifiers Data models Datasets Decision trees Deep learning Gene expression gene expression data Genomes Learning algorithms Machine learning model development Neural networks
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1857
container_issue	6
container_start_page	1846
container_title	IEEE/ACM transactions on computational biology and bioinformatics
container_volume	17
creator	McDermott, Matthew B.A. Wang, Jennifer Zhao, Wen-Ning Sheridan, Steven D. Szolovits, Peter Kohane, Isaac Haggarty, Stephen J. Perlis, Roy H.
description	Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.
doi_str_mv	10.1109/TCBB.2019.2910061
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_8686113</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8686113</ieee_id><sourcerecordid>2468772759</sourcerecordid><originalsourceid>FETCH-LOGICAL-c447t-b9c42f7995404a6021b2f4c40e909060bf70a21ed697d3a9ede7ec33589bb29c3</originalsourceid><addsrcrecordid>eNpdkVtrGzEQhUVoqXPpDwiFspCXvKwzuqy081KoL3UKhrwkz0Irzyab2lpXskv67ytj1yR5mmHmO4cZDmOXHIacA97cj0ejoQCOQ4EcQPMTdsqrypSIWn3Y9aoqK9RywM5SegYQCkF9YgMJiFkGp0xPiNbFnFwMXXgsRhT808rFX6noQzHPplDMKFAxfVlHSqnL04nbuAv2sXXLRJ8P9Zw9_Jjej2_L-d3s5_j7vPRKmU3ZoFeiNYiVAuU0CN6IVnkFhICgoWkNOMFpodEspENakCEvZVVj0wj08px92_uut82KFp7CJrqlXccuH_nX9q6zbzehe7KP_R-rsQapZTa4PhjE_veW0sauuuRpuXSB-m2yQnAQuuJcZfTqHfrcb2PI71mhdG2MMBVmiu8pH_uUIrXHYzjYXSp2l4rdpWIPqWTN19dfHBX_Y8jAlz3QEdFxXetacy7lP-HLjlM</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2468772759</pqid></control><display><type>article</type><title>Deep Learning Benchmarks on L1000 Gene Expression Data</title><source>IEEE Electronic Library (IEL)</source><creator>McDermott, Matthew B.A. ; Wang, Jennifer ; Zhao, Wen-Ning ; Sheridan, Steven D. ; Szolovits, Peter ; Kohane, Isaac ; Haggarty, Stephen J. ; Perlis, Roy H.</creator><creatorcontrib>McDermott, Matthew B.A. ; Wang, Jennifer ; Zhao, Wen-Ning ; Sheridan, Steven D. ; Szolovits, Peter ; Kohane, Isaac ; Haggarty, Stephen J. ; Perlis, Roy H.</creatorcontrib><description>Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.</description><identifier>ISSN: 1545-5963</identifier><identifier>EISSN: 1557-9964</identifier><identifier>DOI: 10.1109/TCBB.2019.2910061</identifier><identifier>PMID: 30990190</identifier><identifier>CODEN: ITCBCY</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Artificial neural networks ; Benchmark testing ; Benchmarks ; Biological system modeling ; Classifiers ; Data models ; Datasets ; Decision trees ; Deep learning ; Gene expression ; gene expression data ; Genomes ; Learning algorithms ; Machine learning ; model development ; Neural networks</subject><ispartof>IEEE/ACM transactions on computational biology and bioinformatics, 2020-11, Vol.17 (6), p.1846-1857</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c447t-b9c42f7995404a6021b2f4c40e909060bf70a21ed697d3a9ede7ec33589bb29c3</citedby><cites>FETCH-LOGICAL-c447t-b9c42f7995404a6021b2f4c40e909060bf70a21ed697d3a9ede7ec33589bb29c3</cites><orcidid>0000-0002-5862-6757 ; 0000-0001-6048-9707</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8686113$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,314,780,784,796,885,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8686113$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/30990190$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>McDermott, Matthew B.A.</creatorcontrib><creatorcontrib>Wang, Jennifer</creatorcontrib><creatorcontrib>Zhao, Wen-Ning</creatorcontrib><creatorcontrib>Sheridan, Steven D.</creatorcontrib><creatorcontrib>Szolovits, Peter</creatorcontrib><creatorcontrib>Kohane, Isaac</creatorcontrib><creatorcontrib>Haggarty, Stephen J.</creatorcontrib><creatorcontrib>Perlis, Roy H.</creatorcontrib><title>Deep Learning Benchmarks on L1000 Gene Expression Data</title><title>IEEE/ACM transactions on computational biology and bioinformatics</title><addtitle>TCBB</addtitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><description>Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.</description><subject>Artificial neural networks</subject><subject>Benchmark testing</subject><subject>Benchmarks</subject><subject>Biological system modeling</subject><subject>Classifiers</subject><subject>Data models</subject><subject>Datasets</subject><subject>Decision trees</subject><subject>Deep learning</subject><subject>Gene expression</subject><subject>gene expression data</subject><subject>Genomes</subject><subject>Learning algorithms</subject><subject>Machine learning</subject><subject>model development</subject><subject>Neural networks</subject><issn>1545-5963</issn><issn>1557-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkVtrGzEQhUVoqXPpDwiFspCXvKwzuqy081KoL3UKhrwkz0Irzyab2lpXskv67ytj1yR5mmHmO4cZDmOXHIacA97cj0ejoQCOQ4EcQPMTdsqrypSIWn3Y9aoqK9RywM5SegYQCkF9YgMJiFkGp0xPiNbFnFwMXXgsRhT808rFX6noQzHPplDMKFAxfVlHSqnL04nbuAv2sXXLRJ8P9Zw9_Jjej2_L-d3s5_j7vPRKmU3ZoFeiNYiVAuU0CN6IVnkFhICgoWkNOMFpodEspENakCEvZVVj0wj08px92_uut82KFp7CJrqlXccuH_nX9q6zbzehe7KP_R-rsQapZTa4PhjE_veW0sauuuRpuXSB-m2yQnAQuuJcZfTqHfrcb2PI71mhdG2MMBVmiu8pH_uUIrXHYzjYXSp2l4rdpWIPqWTN19dfHBX_Y8jAlz3QEdFxXetacy7lP-HLjlM</recordid><startdate>20201101</startdate><enddate>20201101</enddate><creator>McDermott, Matthew B.A.</creator><creator>Wang, Jennifer</creator><creator>Zhao, Wen-Ning</creator><creator>Sheridan, Steven D.</creator><creator>Szolovits, Peter</creator><creator>Kohane, Isaac</creator><creator>Haggarty, Stephen J.</creator><creator>Perlis, Roy H.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-5862-6757</orcidid><orcidid>https://orcid.org/0000-0001-6048-9707</orcidid></search><sort><creationdate>20201101</creationdate><title>Deep Learning Benchmarks on L1000 Gene Expression Data</title><author>McDermott, Matthew B.A. ; Wang, Jennifer ; Zhao, Wen-Ning ; Sheridan, Steven D. ; Szolovits, Peter ; Kohane, Isaac ; Haggarty, Stephen J. ; Perlis, Roy H.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c447t-b9c42f7995404a6021b2f4c40e909060bf70a21ed697d3a9ede7ec33589bb29c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Artificial neural networks</topic><topic>Benchmark testing</topic><topic>Benchmarks</topic><topic>Biological system modeling</topic><topic>Classifiers</topic><topic>Data models</topic><topic>Datasets</topic><topic>Decision trees</topic><topic>Deep learning</topic><topic>Gene expression</topic><topic>gene expression data</topic><topic>Genomes</topic><topic>Learning algorithms</topic><topic>Machine learning</topic><topic>model development</topic><topic>Neural networks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>McDermott, Matthew B.A.</creatorcontrib><creatorcontrib>Wang, Jennifer</creatorcontrib><creatorcontrib>Zhao, Wen-Ning</creatorcontrib><creatorcontrib>Sheridan, Steven D.</creatorcontrib><creatorcontrib>Szolovits, Peter</creatorcontrib><creatorcontrib>Kohane, Isaac</creatorcontrib><creatorcontrib>Haggarty, Stephen J.</creatorcontrib><creatorcontrib>Perlis, Roy H.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>McDermott, Matthew B.A.</au><au>Wang, Jennifer</au><au>Zhao, Wen-Ning</au><au>Sheridan, Steven D.</au><au>Szolovits, Peter</au><au>Kohane, Isaac</au><au>Haggarty, Stephen J.</au><au>Perlis, Roy H.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Deep Learning Benchmarks on L1000 Gene Expression Data</atitle><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle><stitle>TCBB</stitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><date>2020-11-01</date><risdate>2020</risdate><volume>17</volume><issue>6</issue><spage>1846</spage><epage>1857</epage><pages>1846-1857</pages><issn>1545-5963</issn><eissn>1557-9964</eissn><coden>ITCBCY</coden><abstract>Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>30990190</pmid><doi>10.1109/TCBB.2019.2910061</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-5862-6757</orcidid><orcidid>https://orcid.org/0000-0001-6048-9707</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1545-5963
ispartof	IEEE/ACM transactions on computational biology and bioinformatics, 2020-11, Vol.17 (6), p.1846-1857
issn	1545-5963 1557-9964
language	eng
recordid	cdi_ieee_primary_8686113
source	IEEE Electronic Library (IEL)
subjects	Artificial neural networks Benchmark testing Benchmarks Biological system modeling Classifiers Data models Datasets Decision trees Deep learning Gene expression gene expression data Genomes Learning algorithms Machine learning model development Neural networks
title	Deep Learning Benchmarks on L1000 Gene Expression Data
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T18%3A57%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Deep%20Learning%20Benchmarks%20on%20L1000%20Gene%20Expression%20Data&rft.jtitle=IEEE/ACM%20transactions%20on%20computational%20biology%20and%20bioinformatics&rft.au=McDermott,%20Matthew%20B.A.&rft.date=2020-11-01&rft.volume=17&rft.issue=6&rft.spage=1846&rft.epage=1857&rft.pages=1846-1857&rft.issn=1545-5963&rft.eissn=1557-9964&rft.coden=ITCBCY&rft_id=info:doi/10.1109/TCBB.2019.2910061&rft_dat=%3Cproquest_RIE%3E2468772759%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2468772759&rft_id=info:pmid/30990190&rft_ieee_id=8686113&rfr_iscdi=true