Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction

Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also la...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:PloS one 2020-05, Vol.15 (5), p.e0232528
Hauptverfasser: Shapovalov, Maxim, Dunbrack, Jr, Roland L, Vucetic, Slobodan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 5
container_start_page e0232528
container_title PloS one
container_volume 15
creator Shapovalov, Maxim
Dunbrack, Jr, Roland L
Vucetic, Slobodan
description Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.
doi_str_mv 10.1371/journal.pone.0232528
format Article
fullrecord <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_2399252986</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A622865091</galeid><doaj_id>oai_doaj_org_article_b067861708474aa8b824924060e7855d</doaj_id><sourcerecordid>A622865091</sourcerecordid><originalsourceid>FETCH-LOGICAL-c692t-28fc4db408657ceb09318aa7a6ad4787e9059031dab29b7194e0cffe8ed43d8e3</originalsourceid><addsrcrecordid>eNqNk01v1DAQhiMEoqXwDxBEQkJw2MWxE8e-IFUVHysVVeLrajn2ZNclG29tZ6H_ntluWm1QDygHW-NnXo_fyWTZ84LMC1YX7y79EHrdzTe-hzmhjFZUPMiOC8nojFPCHh7sj7InMV4SUjHB-ePsiFFWl7WojrPtl6FLrtUGEthco-B1dDH3bZ6Cdr3rlxi0eYKYdnvj-63vhuQ8knkPQ7hZ0m8ffsW89SHfBJ_A9XkEZK0O13lMYTBpCIBnYJ3ZJT_NHrW6i_BsXE-yHx8_fD_7PDu_-LQ4Oz2fGS5pmlHRmtI2JRG8qg00RLJCaF1rri3WX4MklSSssLqhsqkLWQIxbQsCbMmsAHaSvdzrbjof1WhZVJRJiX5JwZFY7Anr9aXaBLfGmpXXTt0EfFgqHZIzHaiG8FrwoiairEutRSNoKWlJOAH0srKo9X68bWjWYA30aGI3EZ2e9G6lln6rakoo5xIF3owCwV8N6Llau2ig63QPftjXLZjAjiL66h_0_teN1FLjA1zferzX7ETVKacUbSWyQGp-D4WfhbXDNkLrMD5JeDtJQCbBn7TUQ4xq8e3r_7MXP6fs6wN2BbpLqzj-b3EKlnvQBB9jgPbO5IKo3XjcuqF246HG8cC0F4cNuku6nQf2F-ajDKQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2399252986</pqid></control><display><type>article</type><title>Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction</title><source>PLoS</source><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Free Full-Text Journals in Chemistry</source><creator>Shapovalov, Maxim ; Dunbrack, Jr, Roland L ; Vucetic, Slobodan</creator><creatorcontrib>Shapovalov, Maxim ; Dunbrack, Jr, Roland L ; Vucetic, Slobodan</creatorcontrib><description>Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.</description><identifier>ISSN: 1932-6203</identifier><identifier>EISSN: 1932-6203</identifier><identifier>DOI: 10.1371/journal.pone.0232528</identifier><identifier>PMID: 32374785</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Ablation ; Accuracy ; Acids ; Algorithms ; Amino Acid Sequence ; Amino acids ; Amino Acids - chemistry ; Artificial neural networks ; Biology and Life Sciences ; Coils ; Computer and Information Sciences ; Computer architecture ; Databases, Protein - statistics &amp; numerical data ; Deep Learning ; Evaluation ; Homology ; Identity ; Methods ; Neural networks ; Neural Networks, Computer ; Physical Sciences ; Predictions ; Protein structure ; Protein structure prediction ; Protein Structure, Secondary ; Proteins ; Proteins - chemistry ; Protocol (computers) ; Research and Analysis Methods ; Secondary structure ; Software ; Solvents ; Structure (Literature) ; Tertiary structure ; Test sets ; Testing ; Time ; Training</subject><ispartof>PloS one, 2020-05, Vol.15 (5), p.e0232528</ispartof><rights>COPYRIGHT 2020 Public Library of Science</rights><rights>2020 Shapovalov et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2020 Shapovalov et al 2020 Shapovalov et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c692t-28fc4db408657ceb09318aa7a6ad4787e9059031dab29b7194e0cffe8ed43d8e3</citedby><cites>FETCH-LOGICAL-c692t-28fc4db408657ceb09318aa7a6ad4787e9059031dab29b7194e0cffe8ed43d8e3</cites><orcidid>0000-0002-9349-7647 ; 0000-0001-7674-6667</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7202669/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7202669/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,2096,2915,23845,27901,27902,53766,53768,79343,79344</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32374785$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Shapovalov, Maxim</creatorcontrib><creatorcontrib>Dunbrack, Jr, Roland L</creatorcontrib><creatorcontrib>Vucetic, Slobodan</creatorcontrib><title>Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction</title><title>PloS one</title><addtitle>PLoS One</addtitle><description>Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.</description><subject>Ablation</subject><subject>Accuracy</subject><subject>Acids</subject><subject>Algorithms</subject><subject>Amino Acid Sequence</subject><subject>Amino acids</subject><subject>Amino Acids - chemistry</subject><subject>Artificial neural networks</subject><subject>Biology and Life Sciences</subject><subject>Coils</subject><subject>Computer and Information Sciences</subject><subject>Computer architecture</subject><subject>Databases, Protein - statistics &amp; numerical data</subject><subject>Deep Learning</subject><subject>Evaluation</subject><subject>Homology</subject><subject>Identity</subject><subject>Methods</subject><subject>Neural networks</subject><subject>Neural Networks, Computer</subject><subject>Physical Sciences</subject><subject>Predictions</subject><subject>Protein structure</subject><subject>Protein structure prediction</subject><subject>Protein Structure, Secondary</subject><subject>Proteins</subject><subject>Proteins - chemistry</subject><subject>Protocol (computers)</subject><subject>Research and Analysis Methods</subject><subject>Secondary structure</subject><subject>Software</subject><subject>Solvents</subject><subject>Structure (Literature)</subject><subject>Tertiary structure</subject><subject>Test sets</subject><subject>Testing</subject><subject>Time</subject><subject>Training</subject><issn>1932-6203</issn><issn>1932-6203</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>BENPR</sourceid><sourceid>DOA</sourceid><recordid>eNqNk01v1DAQhiMEoqXwDxBEQkJw2MWxE8e-IFUVHysVVeLrajn2ZNclG29tZ6H_ntluWm1QDygHW-NnXo_fyWTZ84LMC1YX7y79EHrdzTe-hzmhjFZUPMiOC8nojFPCHh7sj7InMV4SUjHB-ePsiFFWl7WojrPtl6FLrtUGEthco-B1dDH3bZ6Cdr3rlxi0eYKYdnvj-63vhuQ8knkPQ7hZ0m8ffsW89SHfBJ_A9XkEZK0O13lMYTBpCIBnYJ3ZJT_NHrW6i_BsXE-yHx8_fD_7PDu_-LQ4Oz2fGS5pmlHRmtI2JRG8qg00RLJCaF1rri3WX4MklSSssLqhsqkLWQIxbQsCbMmsAHaSvdzrbjof1WhZVJRJiX5JwZFY7Anr9aXaBLfGmpXXTt0EfFgqHZIzHaiG8FrwoiairEutRSNoKWlJOAH0srKo9X68bWjWYA30aGI3EZ2e9G6lln6rakoo5xIF3owCwV8N6Llau2ig63QPftjXLZjAjiL66h_0_teN1FLjA1zferzX7ETVKacUbSWyQGp-D4WfhbXDNkLrMD5JeDtJQCbBn7TUQ4xq8e3r_7MXP6fs6wN2BbpLqzj-b3EKlnvQBB9jgPbO5IKo3XjcuqF246HG8cC0F4cNuku6nQf2F-ajDKQ</recordid><startdate>20200506</startdate><enddate>20200506</enddate><creator>Shapovalov, Maxim</creator><creator>Dunbrack, Jr, Roland L</creator><creator>Vucetic, Slobodan</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>IOV</scope><scope>ISR</scope><scope>3V.</scope><scope>7QG</scope><scope>7QL</scope><scope>7QO</scope><scope>7RV</scope><scope>7SN</scope><scope>7SS</scope><scope>7T5</scope><scope>7TG</scope><scope>7TM</scope><scope>7U9</scope><scope>7X2</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AO</scope><scope>8C1</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>ATCPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>C1K</scope><scope>CCPQU</scope><scope>D1I</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>H94</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>KB.</scope><scope>KB0</scope><scope>KL.</scope><scope>L6V</scope><scope>LK8</scope><scope>M0K</scope><scope>M0S</scope><scope>M1P</scope><scope>M7N</scope><scope>M7P</scope><scope>M7S</scope><scope>NAPCQ</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PATMY</scope><scope>PDBOC</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>PYCSY</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-9349-7647</orcidid><orcidid>https://orcid.org/0000-0001-7674-6667</orcidid></search><sort><creationdate>20200506</creationdate><title>Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction</title><author>Shapovalov, Maxim ; Dunbrack, Jr, Roland L ; Vucetic, Slobodan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c692t-28fc4db408657ceb09318aa7a6ad4787e9059031dab29b7194e0cffe8ed43d8e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Ablation</topic><topic>Accuracy</topic><topic>Acids</topic><topic>Algorithms</topic><topic>Amino Acid Sequence</topic><topic>Amino acids</topic><topic>Amino Acids - chemistry</topic><topic>Artificial neural networks</topic><topic>Biology and Life Sciences</topic><topic>Coils</topic><topic>Computer and Information Sciences</topic><topic>Computer architecture</topic><topic>Databases, Protein - statistics &amp; numerical data</topic><topic>Deep Learning</topic><topic>Evaluation</topic><topic>Homology</topic><topic>Identity</topic><topic>Methods</topic><topic>Neural networks</topic><topic>Neural Networks, Computer</topic><topic>Physical Sciences</topic><topic>Predictions</topic><topic>Protein structure</topic><topic>Protein structure prediction</topic><topic>Protein Structure, Secondary</topic><topic>Proteins</topic><topic>Proteins - chemistry</topic><topic>Protocol (computers)</topic><topic>Research and Analysis Methods</topic><topic>Secondary structure</topic><topic>Software</topic><topic>Solvents</topic><topic>Structure (Literature)</topic><topic>Tertiary structure</topic><topic>Test sets</topic><topic>Testing</topic><topic>Time</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shapovalov, Maxim</creatorcontrib><creatorcontrib>Dunbrack, Jr, Roland L</creatorcontrib><creatorcontrib>Vucetic, Slobodan</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Opposing Viewpoints</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Animal Behavior Abstracts</collection><collection>Bacteriology Abstracts (Microbiology B)</collection><collection>Biotechnology Research Abstracts</collection><collection>Nursing &amp; Allied Health Database (ProQuest)</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Immunology Abstracts</collection><collection>Meteorological &amp; Geoastrophysical Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Virology and AIDS Abstracts</collection><collection>Agricultural Science Collection</collection><collection>Health &amp; Medical Complete (ProQuest Database)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>ProQuest Public Health Database</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>Agricultural &amp; Environmental Science Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Environmental Sciences and Pollution Management</collection><collection>ProQuest One Community College</collection><collection>ProQuest Materials Science Collection</collection><collection>ProQuest Central</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Materials Science Database</collection><collection>Nursing &amp; Allied Health Database (Alumni Edition)</collection><collection>Meteorological &amp; Geoastrophysical Abstracts - Academic</collection><collection>ProQuest Engineering Collection</collection><collection>Biological Sciences</collection><collection>Agriculture Science Database</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biological Science Database</collection><collection>Engineering Database</collection><collection>Nursing &amp; Allied Health Premium</collection><collection>ProQuest advanced technologies &amp; aerospace journals</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Environmental Science Database</collection><collection>Materials Science Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection><collection>Environmental Science Collection</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PloS one</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shapovalov, Maxim</au><au>Dunbrack, Jr, Roland L</au><au>Vucetic, Slobodan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction</atitle><jtitle>PloS one</jtitle><addtitle>PLoS One</addtitle><date>2020-05-06</date><risdate>2020</risdate><volume>15</volume><issue>5</issue><spage>e0232528</spage><pages>e0232528-</pages><issn>1932-6203</issn><eissn>1932-6203</eissn><abstract>Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>32374785</pmid><doi>10.1371/journal.pone.0232528</doi><tpages>e0232528</tpages><orcidid>https://orcid.org/0000-0002-9349-7647</orcidid><orcidid>https://orcid.org/0000-0001-7674-6667</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1932-6203
ispartof PloS one, 2020-05, Vol.15 (5), p.e0232528
issn 1932-6203
1932-6203
language eng
recordid cdi_plos_journals_2399252986
source PLoS; MEDLINE; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals; PubMed Central; Free Full-Text Journals in Chemistry
subjects Ablation
Accuracy
Acids
Algorithms
Amino Acid Sequence
Amino acids
Amino Acids - chemistry
Artificial neural networks
Biology and Life Sciences
Coils
Computer and Information Sciences
Computer architecture
Databases, Protein - statistics & numerical data
Deep Learning
Evaluation
Homology
Identity
Methods
Neural networks
Neural Networks, Computer
Physical Sciences
Predictions
Protein structure
Protein structure prediction
Protein Structure, Secondary
Proteins
Proteins - chemistry
Protocol (computers)
Research and Analysis Methods
Secondary structure
Software
Solvents
Structure (Literature)
Tertiary structure
Test sets
Testing
Time
Training
title Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T10%3A00%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multifaceted%20analysis%20of%20training%20and%20testing%20convolutional%20neural%20networks%20for%20protein%20secondary%20structure%20prediction&rft.jtitle=PloS%20one&rft.au=Shapovalov,%20Maxim&rft.date=2020-05-06&rft.volume=15&rft.issue=5&rft.spage=e0232528&rft.pages=e0232528-&rft.issn=1932-6203&rft.eissn=1932-6203&rft_id=info:doi/10.1371/journal.pone.0232528&rft_dat=%3Cgale_plos_%3EA622865091%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2399252986&rft_id=info:pmid/32374785&rft_galeid=A622865091&rft_doaj_id=oai_doaj_org_article_b067861708474aa8b824924060e7855d&rfr_iscdi=true