DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources ty...
Gespeichert in:
Veröffentlicht in: | PLoS computational biology 2022-10, Vol.18 (10), p.e1010610 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | 10 |
container_start_page | e1010610 |
container_title | PLoS computational biology |
container_volume | 18 |
creator | Russo, Elena Tea Barone, Federico Bateman, Alex Cozzini, Stefano Punta, Marco Laio, Alessandro |
description | Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository. |
doi_str_mv | 10.1371/journal.pcbi.1010610 |
format | Article |
fullrecord | <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_2737143059</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A724753998</galeid><doaj_id>oai_doaj_org_article_14dfd283ab414e45854d9b5c99ebf75a</doaj_id><sourcerecordid>A724753998</sourcerecordid><originalsourceid>FETCH-LOGICAL-c633t-2ac079bb16c3b55007c422e549ee554074dd463c22318034c9b7ae367b0c55973</originalsourceid><addsrcrecordid>eNqVkk1vEzEQhlcIREvhHyCwxAUOCfb6K-4BqUr5iFRBBfRseb2zW4eNHWxvRf49DkmrBnFBPnjkeead8aupqucETwmV5O0yjNGbYbq2jZsSTLAg-EF1TDinE0n57OG9-Kh6ktIS4xIq8bg6oqIWhRfH1fX55bwzq1N05dO4hnjjErRoHUMG51HJuGGD7GBScp2zJrvgUbNB5-CTyxt0CeYHmg9jyhCd71Ho0GBiDyjBzxG8BdSabBLk9LR61JkhwbP9fVJdfXj_ff5pcvHl42J-djGxgtI8qY3FUjUNEZY2nGMsLatr4EwBcM6wZG3LBLV1TckMU2ZVIw1QIRtsOVeSnlQvd7rrISS9NynpWhbPGMVcFWKxI9pglnod3crEjQ7G6T8PIfbaxOzsAJqwtmvrGTUNIwwYn3HWqoZbpaDpJDdF692-29isoLXgczTDgehhxrtr3YcbrURNyixF4PVeIIbiWMp65ZKFYTAewriduxYMK0FZQV_9hf77d9Md1ZvyAee7UPraclpYORs8dK68n8maSU6VmpWCNwcFhcnwK_dmTEkvvn39D_bzIct2rI0hpQjdnSsE6-0G346vtxus9xtcyl7cd_Su6HZl6W-ZZu0O</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2737143059</pqid></control><display><type>article</type><title>DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets</title><source>PubMed (Medline)</source><source>MEDLINE</source><source>Public Library of Science</source><source>Directory of Open Access Journals</source><source>EZB Electronic Journals Library</source><creator>Russo, Elena Tea ; Barone, Federico ; Bateman, Alex ; Cozzini, Stefano ; Punta, Marco ; Laio, Alessandro</creator><contributor>Dunbrack, Roland L.</contributor><creatorcontrib>Russo, Elena Tea ; Barone, Federico ; Bateman, Alex ; Cozzini, Stefano ; Punta, Marco ; Laio, Alessandro ; Dunbrack, Roland L.</creatorcontrib><description>Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.</description><identifier>ISSN: 1553-7358</identifier><identifier>ISSN: 1553-734X</identifier><identifier>EISSN: 1553-7358</identifier><identifier>DOI: 10.1371/journal.pcbi.1010610</identifier><identifier>PMID: 36260616</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Algorithms ; Amino Acid Sequence ; Annotations ; Automatic classification ; Biology and Life Sciences ; Classification ; Cluster Analysis ; Clustering ; Databases, Protein ; Datasets ; Density ; Domains ; Hypotheses ; Identification and classification ; Methods ; Protein Domains ; Protein families ; Proteins ; Proteins - genetics ; Research and Analysis Methods ; Sequences</subject><ispartof>PLoS computational biology, 2022-10, Vol.18 (10), p.e1010610</ispartof><rights>COPYRIGHT 2022 Public Library of Science</rights><rights>2022 Russo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2022 Russo et al 2022 Russo et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c633t-2ac079bb16c3b55007c422e549ee554074dd463c22318034c9b7ae367b0c55973</citedby><cites>FETCH-LOGICAL-c633t-2ac079bb16c3b55007c422e549ee554074dd463c22318034c9b7ae367b0c55973</cites><orcidid>0000-0002-6982-4660 ; 0000-0001-5696-670X ; 0000-0002-0061-2328 ; 0000-0001-9164-7907</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9621593/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9621593/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,864,885,2102,2928,23866,27924,27925,53791,53793,79600,79601</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/36260616$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Dunbrack, Roland L.</contributor><creatorcontrib>Russo, Elena Tea</creatorcontrib><creatorcontrib>Barone, Federico</creatorcontrib><creatorcontrib>Bateman, Alex</creatorcontrib><creatorcontrib>Cozzini, Stefano</creatorcontrib><creatorcontrib>Punta, Marco</creatorcontrib><creatorcontrib>Laio, Alessandro</creatorcontrib><title>DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets</title><title>PLoS computational biology</title><addtitle>PLoS Comput Biol</addtitle><description>Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.</description><subject>Algorithms</subject><subject>Amino Acid Sequence</subject><subject>Annotations</subject><subject>Automatic classification</subject><subject>Biology and Life Sciences</subject><subject>Classification</subject><subject>Cluster Analysis</subject><subject>Clustering</subject><subject>Databases, Protein</subject><subject>Datasets</subject><subject>Density</subject><subject>Domains</subject><subject>Hypotheses</subject><subject>Identification and classification</subject><subject>Methods</subject><subject>Protein Domains</subject><subject>Protein families</subject><subject>Proteins</subject><subject>Proteins - genetics</subject><subject>Research and Analysis Methods</subject><subject>Sequences</subject><issn>1553-7358</issn><issn>1553-734X</issn><issn>1553-7358</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>DOA</sourceid><recordid>eNqVkk1vEzEQhlcIREvhHyCwxAUOCfb6K-4BqUr5iFRBBfRseb2zW4eNHWxvRf49DkmrBnFBPnjkeead8aupqucETwmV5O0yjNGbYbq2jZsSTLAg-EF1TDinE0n57OG9-Kh6ktIS4xIq8bg6oqIWhRfH1fX55bwzq1N05dO4hnjjErRoHUMG51HJuGGD7GBScp2zJrvgUbNB5-CTyxt0CeYHmg9jyhCd71Ho0GBiDyjBzxG8BdSabBLk9LR61JkhwbP9fVJdfXj_ff5pcvHl42J-djGxgtI8qY3FUjUNEZY2nGMsLatr4EwBcM6wZG3LBLV1TckMU2ZVIw1QIRtsOVeSnlQvd7rrISS9NynpWhbPGMVcFWKxI9pglnod3crEjQ7G6T8PIfbaxOzsAJqwtmvrGTUNIwwYn3HWqoZbpaDpJDdF692-29isoLXgczTDgehhxrtr3YcbrURNyixF4PVeIIbiWMp65ZKFYTAewriduxYMK0FZQV_9hf77d9Md1ZvyAee7UPraclpYORs8dK68n8maSU6VmpWCNwcFhcnwK_dmTEkvvn39D_bzIct2rI0hpQjdnSsE6-0G346vtxus9xtcyl7cd_Su6HZl6W-ZZu0O</recordid><startdate>20221019</startdate><enddate>20221019</enddate><creator>Russo, Elena Tea</creator><creator>Barone, Federico</creator><creator>Bateman, Alex</creator><creator>Cozzini, Stefano</creator><creator>Punta, Marco</creator><creator>Laio, Alessandro</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISN</scope><scope>ISR</scope><scope>3V.</scope><scope>7QO</scope><scope>7QP</scope><scope>7TK</scope><scope>7TM</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>K9.</scope><scope>LK8</scope><scope>M0N</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-6982-4660</orcidid><orcidid>https://orcid.org/0000-0001-5696-670X</orcidid><orcidid>https://orcid.org/0000-0002-0061-2328</orcidid><orcidid>https://orcid.org/0000-0001-9164-7907</orcidid></search><sort><creationdate>20221019</creationdate><title>DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets</title><author>Russo, Elena Tea ; Barone, Federico ; Bateman, Alex ; Cozzini, Stefano ; Punta, Marco ; Laio, Alessandro</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c633t-2ac079bb16c3b55007c422e549ee554074dd463c22318034c9b7ae367b0c55973</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Amino Acid Sequence</topic><topic>Annotations</topic><topic>Automatic classification</topic><topic>Biology and Life Sciences</topic><topic>Classification</topic><topic>Cluster Analysis</topic><topic>Clustering</topic><topic>Databases, Protein</topic><topic>Datasets</topic><topic>Density</topic><topic>Domains</topic><topic>Hypotheses</topic><topic>Identification and classification</topic><topic>Methods</topic><topic>Protein Domains</topic><topic>Protein families</topic><topic>Proteins</topic><topic>Proteins - genetics</topic><topic>Research and Analysis Methods</topic><topic>Sequences</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Russo, Elena Tea</creatorcontrib><creatorcontrib>Barone, Federico</creatorcontrib><creatorcontrib>Bateman, Alex</creatorcontrib><creatorcontrib>Cozzini, Stefano</creatorcontrib><creatorcontrib>Punta, Marco</creatorcontrib><creatorcontrib>Laio, Alessandro</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Canada</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium & Calcified Tissue Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>ProQuest Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Biological Sciences</collection><collection>Computing Database</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>PML(ProQuest Medical Library)</collection><collection>ProQuest Biological Science Journals</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>ProQuest Publicly Available Content database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>Directory of Open Access Journals</collection><jtitle>PLoS computational biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Russo, Elena Tea</au><au>Barone, Federico</au><au>Bateman, Alex</au><au>Cozzini, Stefano</au><au>Punta, Marco</au><au>Laio, Alessandro</au><au>Dunbrack, Roland L.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets</atitle><jtitle>PLoS computational biology</jtitle><addtitle>PLoS Comput Biol</addtitle><date>2022-10-19</date><risdate>2022</risdate><volume>18</volume><issue>10</issue><spage>e1010610</spage><pages>e1010610-</pages><issn>1553-7358</issn><issn>1553-734X</issn><eissn>1553-7358</eissn><abstract>Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>36260616</pmid><doi>10.1371/journal.pcbi.1010610</doi><tpages>e1010610</tpages><orcidid>https://orcid.org/0000-0002-6982-4660</orcidid><orcidid>https://orcid.org/0000-0001-5696-670X</orcidid><orcidid>https://orcid.org/0000-0002-0061-2328</orcidid><orcidid>https://orcid.org/0000-0001-9164-7907</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1553-7358 |
ispartof | PLoS computational biology, 2022-10, Vol.18 (10), p.e1010610 |
issn | 1553-7358 1553-734X 1553-7358 |
language | eng |
recordid | cdi_plos_journals_2737143059 |
source | PubMed (Medline); MEDLINE; Public Library of Science; Directory of Open Access Journals; EZB Electronic Journals Library |
subjects | Algorithms Amino Acid Sequence Annotations Automatic classification Biology and Life Sciences Classification Cluster Analysis Clustering Databases, Protein Datasets Density Domains Hypotheses Identification and classification Methods Protein Domains Protein families Proteins Proteins - genetics Research and Analysis Methods Sequences |
title | DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T05%3A52%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=DPCfam:%20Unsupervised%20protein%20family%20classification%20by%20Density%20Peak%20Clustering%20of%20large%20sequence%20datasets&rft.jtitle=PLoS%20computational%20biology&rft.au=Russo,%20Elena%20Tea&rft.date=2022-10-19&rft.volume=18&rft.issue=10&rft.spage=e1010610&rft.pages=e1010610-&rft.issn=1553-7358&rft.eissn=1553-7358&rft_id=info:doi/10.1371/journal.pcbi.1010610&rft_dat=%3Cgale_plos_%3EA724753998%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2737143059&rft_id=info:pmid/36260616&rft_galeid=A724753998&rft_doaj_id=oai_doaj_org_article_14dfd283ab414e45854d9b5c99ebf75a&rfr_iscdi=true |