Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction

Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also la...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PloS one 2020-05, Vol.15 (5), p.e0232528
Hauptverfasser:	Shapovalov, Maxim, Dunbrack, Jr, Roland L, Vucetic, Slobodan
Format:	Artikel
Sprache:	eng
Schlagworte:	Ablation Accuracy Acids Algorithms Amino Acid Sequence Amino acids Amino Acids - chemistry Artificial neural networks Biology and Life Sciences Coils Computer and Information Sciences Computer architecture Databases, Protein - statistics & numerical data Deep Learning Evaluation Homology Identity Methods Neural networks Neural Networks, Computer Physical Sciences Predictions Protein structure Protein structure prediction Protein Structure, Secondary Proteins Proteins - chemistry Protocol (computers) Research and Analysis Methods Secondary structure Software Solvents Structure (Literature) Tertiary structure Test sets Testing Time Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	5
container_start_page	e0232528
container_title	PloS one
container_volume	15
creator	Shapovalov, Maxim Dunbrack, Jr, Roland L Vucetic, Slobodan
description	Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.
doi_str_mv	10.1371/journal.pone.0232528
format	Article
fullrecord	<record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_2399252986</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A622865091</galeid><doaj_id>oai_doaj_org_article_b067861708474aa8b824924060e7855d</doaj_id><sourcerecordid>A622865091</sourcerecordid><originalsourceid>FETCH-LOGICAL-c692t-28fc4db408657ceb09318aa7a6ad4787e9059031dab29b7194e0cffe8ed43d8e3</originalsourceid><addsrcrecordid>eNqNk01v1DAQhiMEoqXwDxBEQkJw2MWxE8e-IFUVHysVVeLrajn2ZNclG29tZ6H_ntluWm1QDygHW-NnXo_fyWTZ84LMC1YX7y79EHrdzTe-hzmhjFZUPMiOC8nojFPCHh7sj7InMV4SUjHB-ePsiFFWl7WojrPtl6FLrtUGEthco-B1dDH3bZ6Cdr3rlxi0eYKYdnvj-63vhuQ8knkPQ7hZ0m8ffsW89SHfBJ_A9XkEZK0O13lMYTBpCIBnYJ3ZJT_NHrW6i_BsXE-yHx8_fD_7PDu_-LQ4Oz2fGS5pmlHRmtI2JRG8qg00RLJCaF1rri3WX4MklSSssLqhsqkLWQIxbQsCbMmsAHaSvdzrbjof1WhZVJRJiX5JwZFY7Anr9aXaBLfGmpXXTt0EfFgqHZIzHaiG8FrwoiairEutRSNoKWlJOAH0srKo9X68bWjWYA30aGI3EZ2e9G6lln6rakoo5xIF3owCwV8N6Llau2ig63QPftjXLZjAjiL66h_0_teN1FLjA1zferzX7ETVKacUbSWyQGp-D4WfhbXDNkLrMD5JeDtJQCbBn7TUQ4xq8e3r_7MXP6fs6wN2BbpLqzj-b3EKlnvQBB9jgPbO5IKo3XjcuqF246HG8cC0F4cNuku6nQf2F-ajDKQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2399252986</pqid></control><display><type>article</type><title>Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction</title><source>PLoS</source><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Free Full-Text Journals in Chemistry</source><creator>Shapovalov, Maxim ; Dunbrack, Jr, Roland L ; Vucetic, Slobodan</creator><creatorcontrib>Shapovalov, Maxim ; Dunbrack, Jr, Roland L ; Vucetic, Slobodan</creatorcontrib><description>Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.</description><identifier>ISSN: 1932-6203</identifier><identifier>EISSN: 1932-6203</identifier><identifier>DOI: 10.1371/journal.pone.0232528</identifier><identifier>PMID: 32374785</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Ablation ; Accuracy ; Acids ; Algorithms ; Amino Acid Sequence ; Amino acids ; Amino Acids - chemistry ; Artificial neural networks ; Biology and Life Sciences ; Coils ; Computer and Information Sciences ; Computer architecture ; Databases, Protein - statistics & numerical data ; Deep Learning ; Evaluation ; Homology ; Identity ; Methods ; Neural networks ; Neural Networks, Computer ; Physical Sciences ; Predictions ; Protein structure ; Protein structure prediction ; Protein Structure, Secondary ; Proteins ; Proteins - chemistry ; Protocol (computers) ; Research and Analysis Methods ; Secondary structure ; Software ; Solvents ; Structure (Literature) ; Tertiary structure ; Test sets ; Testing ; Time ; Training</subject><ispartof>PloS one, 2020-05, Vol.15 (5), p.e0232528</ispartof><rights>COPYRIGHT 2020 Public Library of Science</rights><rights>2020 Shapovalov et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2020 Shapovalov et al 2020 Shapovalov et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c692t-28fc4db408657ceb09318aa7a6ad4787e9059031dab29b7194e0cffe8ed43d8e3</citedby><cites>FETCH-LOGICAL-c692t-28fc4db408657ceb09318aa7a6ad4787e9059031dab29b7194e0cffe8ed43d8e3</cites><orcidid>0000-0002-9349-7647 ; 0000-0001-7674-6667</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7202669/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7202669/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,2096,2915,23845,27901,27902,53766,53768,79343,79344</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32374785$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Shapovalov, Maxim</creatorcontrib><creatorcontrib>Dunbrack, Jr, Roland L</creatorcontrib><creatorcontrib>Vucetic, Slobodan</creatorcontrib><title>Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction</title><title>PloS one</title><addtitle>PLoS One</addtitle><description>Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.</description><subject>Ablation</subject><subject>Accuracy</subject><subject>Acids</subject><subject>Algorithms</subject><subject>Amino Acid Sequence</subject><subject>Amino acids</subject><subject>Amino Acids - chemistry</subject><subject>Artificial neural networks</subject><subject>Biology and Life Sciences</subject><subject>Coils</subject><subject>Computer and Information Sciences</subject><subject>Computer architecture</subject><subject>Databases, Protein - statistics & numerical data</subject><subject>Deep Learning</subject><subject>Evaluation</subject><subject>Homology</subject><subject>Identity</subject><subject>Methods</subject><subject>Neural networks</subject><subject>Neural Networks, Computer</subject><subject>Physical Sciences</subject><subject>Predictions</subject><subject>Protein structure</subject><subject>Protein structure prediction</subject><subject>Protein Structure, Secondary</subject><subject>Proteins</subject><subject>Proteins - chemistry</subject><subject>Protocol (computers)</subject><subject>Research and Analysis Methods</subject><subject>Secondary structure</subject><subject>Software</subject><subject>Solvents</subject><subject>Structure (Literature)</subject><subject>Tertiary structure</subject><subject>Test sets</subject><subject>Testing</subject><subject>Time</subject><subject>Training</subject><issn>1932-6203</issn><issn>1932-6203</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>BENPR</sourceid><sourceid>DOA</sourceid><recordid>eNqNk01v1DAQhiMEoqXwDxBEQkJw2MWxE8e-IFUVHysVVeLrajn2ZNclG29tZ6H_ntluWm1QDygHW-NnXo_fyWTZ84LMC1YX7y79EHrdzTe-hzmhjFZUPMiOC8nojFPCHh7sj7InMV4SUjHB-ePsiFFWl7WojrPtl6FLrtUGEthco-B1dDH3bZ6Cdr3rlxi0eYKYdnvj-63vhuQ8knkPQ7hZ0m8ffsW89SHfBJ_A9XkEZK0O13lMYTBpCIBnYJ3ZJT_NHrW6i_BsXE-yHx8_fD_7PDu_-LQ4Oz2fGS5pmlHRmtI2JRG8qg00RLJCaF1rri3WX4MklSSssLqhsqkLWQIxbQsCbMmsAHaSvdzrbjof1WhZVJRJiX5JwZFY7Anr9aXaBLfGmpXXTt0EfFgqHZIzHaiG8FrwoiairEutRSNoKWlJOAH0srKo9X68bWjWYA30aGI3EZ2e9G6lln6rakoo5xIF3owCwV8N6Llau2ig63QPftjXLZjAjiL66h_0_teN1FLjA1zferzX7ETVKacUbSWyQGp-D4WfhbXDNkLrMD5JeDtJQCbBn7TUQ4xq8e3r_7MXP6fs6wN2BbpLqzj-b3EKlnvQBB9jgPbO5IKo3XjcuqF246HG8cC0F4cNuku6nQf2F-ajDKQ</recordid><startdate>20200506</startdate><enddate>20200506</enddate><creator>Shapovalov, Maxim</creator><creator>Dunbrack, Jr, Roland L</creator><creator>Vucetic, Slobodan</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>IOV</scope><scope>ISR</scope><scope>3V.</scope><scope>7QG</scope><scope>7QL</scope><scope>7QO</scope><scope>7RV</scope><scope>7SN</scope><scope>7SS</scope><scope>7T5</scope><scope>7TG</scope><scope>7TM</scope><scope>7U9</scope><scope>7X2</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AO</scope><scope>8C1</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>ATCPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>C1K</scope><scope>CCPQU</scope><scope>D1I</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>H94</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>KB.</scope><scope>KB0</scope><scope>KL.</scope><scope>L6V</scope><scope>LK8</scope><scope>M0K</scope><scope>M0S</scope><scope>M1P</scope><scope>M7N</scope><scope>M7P</scope><scope>M7S</scope><scope>NAPCQ</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PATMY</scope><scope>PDBOC</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>PYCSY</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-9349-7647</orcidid><orcidid>https://orcid.org/0000-0001-7674-6667</orcidid></search><sort><creationdate>20200506</creationdate><title>Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction</title><author>Shapovalov, Maxim ; Dunbrack, Jr, Roland L ; Vucetic, Slobodan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c692t-28fc4db408657ceb09318aa7a6ad4787e9059031dab29b7194e0cffe8ed43d8e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Ablation</topic><topic>Accuracy</topic><topic>Acids</topic><topic>Algorithms</topic><topic>Amino Acid Sequence</topic><topic>Amino acids</topic><topic>Amino Acids - chemistry</topic><topic>Artificial neural networks</topic><topic>Biology and Life Sciences</topic><topic>Coils</topic><topic>Computer and Information Sciences</topic><topic>Computer architecture</topic><topic>Databases, Protein - statistics & numerical data</topic><topic>Deep Learning</topic><topic>Evaluation</topic><topic>Homology</topic><topic>Identity</topic><topic>Methods</topic><topic>Neural networks</topic><topic>Neural Networks, Computer</topic><topic>Physical Sciences</topic><topic>Predictions</topic><topic>Protein structure</topic><topic>Protein structure prediction</topic><topic>Protein Structure, Secondary</topic><topic>Proteins</topic><topic>Proteins - chemistry</topic><topic>Protocol (computers)</topic><topic>Research and Analysis Methods</topic><topic>Secondary structure</topic><topic>Software</topic><topic>Solvents</topic><topic>Structure (Literature)</topic><topic>Tertiary structure</topic><topic>Test sets</topic><topic>Testing</topic><topic>Time</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shapovalov, Maxim</creatorcontrib><creatorcontrib>Dunbrack, Jr, Roland L</creatorcontrib><creatorcontrib>Vucetic, Slobodan</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Opposing Viewpoints</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Animal Behavior Abstracts</collection><collection>Bacteriology Abstracts (Microbiology B)</collection><collection>Biotechnology Research Abstracts</collection><collection>Nursing & Allied Health Database (ProQuest)</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Immunology Abstracts</collection><collection>Meteorological & Geoastrophysical Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Virology and AIDS Abstracts</collection><collection>Agricultural Science Collection</collection><collection>Health & Medical Complete (ProQuest Database)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>ProQuest Public Health Database</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>Agricultural & Environmental Science Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Environmental Sciences and Pollution Management</collection><collection>ProQuest One Community College</collection><collection>ProQuest Materials Science Collection</collection><collection>ProQuest Central</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Materials Science Database</collection><collection>Nursing & Allied Health Database (Alumni Edition)</collection><collection>Meteorological & Geoastrophysical Abstracts - Academic</collection><collection>ProQuest Engineering Collection</collection><collection>Biological Sciences</collection><collection>Agriculture Science Database</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biological Science Database</collection><collection>Engineering Database</collection><collection>Nursing & Allied Health Premium</collection><collection>ProQuest advanced technologies & aerospace journals</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Environmental Science Database</collection><collection>Materials Science Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection><collection>Environmental Science Collection</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PloS one</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shapovalov, Maxim</au><au>Dunbrack, Jr, Roland L</au><au>Vucetic, Slobodan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction</atitle><jtitle>PloS one</jtitle><addtitle>PLoS One</addtitle><date>2020-05-06</date><risdate>2020</risdate><volume>15</volume><issue>5</issue><spage>e0232528</spage><pages>e0232528-</pages><issn>1932-6203</issn><eissn>1932-6203</eissn><abstract>Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>32374785</pmid><doi>10.1371/journal.pone.0232528</doi><tpages>e0232528</tpages><orcidid>https://orcid.org/0000-0002-9349-7647</orcidid><orcidid>https://orcid.org/0000-0001-7674-6667</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1932-6203
ispartof	PloS one, 2020-05, Vol.15 (5), p.e0232528
issn	1932-6203 1932-6203
language	eng
recordid	cdi_plos_journals_2399252986
source	PLoS; MEDLINE; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals; PubMed Central; Free Full-Text Journals in Chemistry
subjects	Ablation Accuracy Acids Algorithms Amino Acid Sequence Amino acids Amino Acids - chemistry Artificial neural networks Biology and Life Sciences Coils Computer and Information Sciences Computer architecture Databases, Protein - statistics & numerical data Deep Learning Evaluation Homology Identity Methods Neural networks Neural Networks, Computer Physical Sciences Predictions Protein structure Protein structure prediction Protein Structure, Secondary Proteins Proteins - chemistry Protocol (computers) Research and Analysis Methods Secondary structure Software Solvents Structure (Literature) Tertiary structure Test sets Testing Time Training
title	Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T10%3A00%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multifaceted%20analysis%20of%20training%20and%20testing%20convolutional%20neural%20networks%20for%20protein%20secondary%20structure%20prediction&rft.jtitle=PloS%20one&rft.au=Shapovalov,%20Maxim&rft.date=2020-05-06&rft.volume=15&rft.issue=5&rft.spage=e0232528&rft.pages=e0232528-&rft.issn=1932-6203&rft.eissn=1932-6203&rft_id=info:doi/10.1371/journal.pone.0232528&rft_dat=%3Cgale_plos_%3EA622865091%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2399252986&rft_id=info:pmid/32374785&rft_galeid=A622865091&rft_doaj_id=oai_doaj_org_article_b067861708474aa8b824924060e7855d&rfr_iscdi=true