Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
Abstract Motivation Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for thi...
Gespeichert in:
Veröffentlicht in: | Bioinformatics 2021-04, Vol.37 (2), p.162-170 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 170 |
---|---|
container_issue | 2 |
container_start_page | 162 |
container_title | Bioinformatics |
container_volume | 37 |
creator | Villegas-Morcillo, Amelia Makrodimitris, Stavros van Ham, Roeland C H J Gomez, Angel M Sanchez, Victoria Reinders, Marcel J T |
description | Abstract
Motivation
Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available.
Results
We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining.
Availability and implementation
Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function.
Supplementary information
Supplementary data are available at Bioinformatics online. |
doi_str_mv | 10.1093/bioinformatics/btaa701 |
format | Article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmed_primary_32797179</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/bioinformatics/btaa701</oup_id><sourcerecordid>2434481065</sourcerecordid><originalsourceid>FETCH-LOGICAL-c522t-cb22ac804c6afb0303e4dab270017958ef237915fd44c128c22bd25a612098c03</originalsourceid><addsrcrecordid>eNqNkctuFTEMhiMEoqXwClWWbIY6t7lskFDFTarEhq6jTOJpg2aSIZcK3p4cnUNFd6xs2b8_2_oJuWTwjsEkrmYffVhi2kzxNl_NxZgB2DNyzmQPHQc1PW-56IdOjiDOyKucfwAoJqV8Sc4EH6aBDdM5-XUbct0xPfiMju4pFvSB4jajcz7cZRprae3DJnpvgutsMktp0ow_KwaLtBVpLqnaUhPSBc0hZmpKo6HztjQM3eKKtq4m0aWGVorhNXmxmDXjm1O8ILefPn6__tLdfPv89frDTWcV56WzM-fGjiBtb5YZBAiUzsx8AGj3qxEXLoaJqcVJaRkfLeez48r0jMM0WhAX5P2Ru9d5Q2cxlGRWvSe_mfRbR-P1007w9_ouPugRlOJMNMDbEyDF9nIuevPZ4rqagLFmzaWQcmTQqybtj1KbYs4Jl8c1DPTBNv3UNn2yrQ1e_nvk49hfn5qAHQWx7v8L_QP077An</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2434481065</pqid></control><display><type>article</type><title>Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function</title><source>Oxford Journals Open Access Collection</source><source>MEDLINE</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><creator>Villegas-Morcillo, Amelia ; Makrodimitris, Stavros ; van Ham, Roeland C H J ; Gomez, Angel M ; Sanchez, Victoria ; Reinders, Marcel J T</creator><creatorcontrib>Villegas-Morcillo, Amelia ; Makrodimitris, Stavros ; van Ham, Roeland C H J ; Gomez, Angel M ; Sanchez, Victoria ; Reinders, Marcel J T</creatorcontrib><description>Abstract
Motivation
Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available.
Results
We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining.
Availability and implementation
Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function.
Supplementary information
Supplementary data are available at Bioinformatics online.</description><identifier>ISSN: 1367-4803</identifier><identifier>EISSN: 1460-2059</identifier><identifier>EISSN: 1367-4811</identifier><identifier>DOI: 10.1093/bioinformatics/btaa701</identifier><identifier>PMID: 32797179</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Amino Acid Sequence ; Neural Networks, Computer ; Original Papers ; Proteins - genetics ; Software</subject><ispartof>Bioinformatics, 2021-04, Vol.37 (2), p.162-170</ispartof><rights>The Author(s) 2020. Published by Oxford University Press. 2020</rights><rights>The Author(s) 2020. Published by Oxford University Press.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c522t-cb22ac804c6afb0303e4dab270017958ef237915fd44c128c22bd25a612098c03</citedby><cites>FETCH-LOGICAL-c522t-cb22ac804c6afb0303e4dab270017958ef237915fd44c128c22bd25a612098c03</cites><orcidid>0000-0002-3111-4268 ; 0000-0002-3286-049X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055213/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055213/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,1598,27901,27902,53766,53768</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32797179$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Villegas-Morcillo, Amelia</creatorcontrib><creatorcontrib>Makrodimitris, Stavros</creatorcontrib><creatorcontrib>van Ham, Roeland C H J</creatorcontrib><creatorcontrib>Gomez, Angel M</creatorcontrib><creatorcontrib>Sanchez, Victoria</creatorcontrib><creatorcontrib>Reinders, Marcel J T</creatorcontrib><title>Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function</title><title>Bioinformatics</title><addtitle>Bioinformatics</addtitle><description>Abstract
Motivation
Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available.
Results
We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining.
Availability and implementation
Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function.
Supplementary information
Supplementary data are available at Bioinformatics online.</description><subject>Amino Acid Sequence</subject><subject>Neural Networks, Computer</subject><subject>Original Papers</subject><subject>Proteins - genetics</subject><subject>Software</subject><issn>1367-4803</issn><issn>1460-2059</issn><issn>1367-4811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>TOX</sourceid><sourceid>EIF</sourceid><recordid>eNqNkctuFTEMhiMEoqXwClWWbIY6t7lskFDFTarEhq6jTOJpg2aSIZcK3p4cnUNFd6xs2b8_2_oJuWTwjsEkrmYffVhi2kzxNl_NxZgB2DNyzmQPHQc1PW-56IdOjiDOyKucfwAoJqV8Sc4EH6aBDdM5-XUbct0xPfiMju4pFvSB4jajcz7cZRprae3DJnpvgutsMktp0ow_KwaLtBVpLqnaUhPSBc0hZmpKo6HztjQM3eKKtq4m0aWGVorhNXmxmDXjm1O8ILefPn6__tLdfPv89frDTWcV56WzM-fGjiBtb5YZBAiUzsx8AGj3qxEXLoaJqcVJaRkfLeez48r0jMM0WhAX5P2Ru9d5Q2cxlGRWvSe_mfRbR-P1007w9_ouPugRlOJMNMDbEyDF9nIuevPZ4rqagLFmzaWQcmTQqybtj1KbYs4Jl8c1DPTBNv3UNn2yrQ1e_nvk49hfn5qAHQWx7v8L_QP077An</recordid><startdate>20210419</startdate><enddate>20210419</enddate><creator>Villegas-Morcillo, Amelia</creator><creator>Makrodimitris, Stavros</creator><creator>van Ham, Roeland C H J</creator><creator>Gomez, Angel M</creator><creator>Sanchez, Victoria</creator><creator>Reinders, Marcel J T</creator><general>Oxford University Press</general><scope>TOX</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-3111-4268</orcidid><orcidid>https://orcid.org/0000-0002-3286-049X</orcidid></search><sort><creationdate>20210419</creationdate><title>Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function</title><author>Villegas-Morcillo, Amelia ; Makrodimitris, Stavros ; van Ham, Roeland C H J ; Gomez, Angel M ; Sanchez, Victoria ; Reinders, Marcel J T</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c522t-cb22ac804c6afb0303e4dab270017958ef237915fd44c128c22bd25a612098c03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Amino Acid Sequence</topic><topic>Neural Networks, Computer</topic><topic>Original Papers</topic><topic>Proteins - genetics</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Villegas-Morcillo, Amelia</creatorcontrib><creatorcontrib>Makrodimitris, Stavros</creatorcontrib><creatorcontrib>van Ham, Roeland C H J</creatorcontrib><creatorcontrib>Gomez, Angel M</creatorcontrib><creatorcontrib>Sanchez, Victoria</creatorcontrib><creatorcontrib>Reinders, Marcel J T</creatorcontrib><collection>Oxford Journals Open Access Collection</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Villegas-Morcillo, Amelia</au><au>Makrodimitris, Stavros</au><au>van Ham, Roeland C H J</au><au>Gomez, Angel M</au><au>Sanchez, Victoria</au><au>Reinders, Marcel J T</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function</atitle><jtitle>Bioinformatics</jtitle><addtitle>Bioinformatics</addtitle><date>2021-04-19</date><risdate>2021</risdate><volume>37</volume><issue>2</issue><spage>162</spage><epage>170</epage><pages>162-170</pages><issn>1367-4803</issn><eissn>1460-2059</eissn><eissn>1367-4811</eissn><abstract>Abstract
Motivation
Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available.
Results
We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining.
Availability and implementation
Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function.
Supplementary information
Supplementary data are available at Bioinformatics online.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>32797179</pmid><doi>10.1093/bioinformatics/btaa701</doi><tpages>9</tpages><orcidid>https://orcid.org/0000-0002-3111-4268</orcidid><orcidid>https://orcid.org/0000-0002-3286-049X</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1367-4803 |
ispartof | Bioinformatics, 2021-04, Vol.37 (2), p.162-170 |
issn | 1367-4803 1460-2059 1367-4811 |
language | eng |
recordid | cdi_pubmed_primary_32797179 |
source | Oxford Journals Open Access Collection; MEDLINE; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; PubMed Central; Alma/SFX Local Collection |
subjects | Amino Acid Sequence Neural Networks, Computer Original Papers Proteins - genetics Software |
title | Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-18T09%3A18%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unsupervised%20protein%20embeddings%20outperform%20hand-crafted%20sequence%20and%20structure%20features%20at%20predicting%20molecular%20function&rft.jtitle=Bioinformatics&rft.au=Villegas-Morcillo,%20Amelia&rft.date=2021-04-19&rft.volume=37&rft.issue=2&rft.spage=162&rft.epage=170&rft.pages=162-170&rft.issn=1367-4803&rft.eissn=1460-2059&rft_id=info:doi/10.1093/bioinformatics/btaa701&rft_dat=%3Cproquest_pubme%3E2434481065%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2434481065&rft_id=info:pmid/32797179&rft_oup_id=10.1093/bioinformatics/btaa701&rfr_iscdi=true |