DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources ty...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:PLoS computational biology 2022-10, Vol.18 (10), p.e1010610
Hauptverfasser: Russo, Elena Tea, Barone, Federico, Bateman, Alex, Cozzini, Stefano, Punta, Marco, Laio, Alessandro
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 10
container_start_page e1010610
container_title PLoS computational biology
container_volume 18
creator Russo, Elena Tea
Barone, Federico
Bateman, Alex
Cozzini, Stefano
Punta, Marco
Laio, Alessandro
description Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
doi_str_mv 10.1371/journal.pcbi.1010610
format Article
fullrecord <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_2737143059</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A724753998</galeid><doaj_id>oai_doaj_org_article_14dfd283ab414e45854d9b5c99ebf75a</doaj_id><sourcerecordid>A724753998</sourcerecordid><originalsourceid>FETCH-LOGICAL-c633t-2ac079bb16c3b55007c422e549ee554074dd463c22318034c9b7ae367b0c55973</originalsourceid><addsrcrecordid>eNqVkk1vEzEQhlcIREvhHyCwxAUOCfb6K-4BqUr5iFRBBfRseb2zW4eNHWxvRf49DkmrBnFBPnjkeead8aupqucETwmV5O0yjNGbYbq2jZsSTLAg-EF1TDinE0n57OG9-Kh6ktIS4xIq8bg6oqIWhRfH1fX55bwzq1N05dO4hnjjErRoHUMG51HJuGGD7GBScp2zJrvgUbNB5-CTyxt0CeYHmg9jyhCd71Ho0GBiDyjBzxG8BdSabBLk9LR61JkhwbP9fVJdfXj_ff5pcvHl42J-djGxgtI8qY3FUjUNEZY2nGMsLatr4EwBcM6wZG3LBLV1TckMU2ZVIw1QIRtsOVeSnlQvd7rrISS9NynpWhbPGMVcFWKxI9pglnod3crEjQ7G6T8PIfbaxOzsAJqwtmvrGTUNIwwYn3HWqoZbpaDpJDdF692-29isoLXgczTDgehhxrtr3YcbrURNyixF4PVeIIbiWMp65ZKFYTAewriduxYMK0FZQV_9hf77d9Md1ZvyAee7UPraclpYORs8dK68n8maSU6VmpWCNwcFhcnwK_dmTEkvvn39D_bzIct2rI0hpQjdnSsE6-0G346vtxus9xtcyl7cd_Su6HZl6W-ZZu0O</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2737143059</pqid></control><display><type>article</type><title>DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets</title><source>PubMed (Medline)</source><source>MEDLINE</source><source>Public Library of Science</source><source>Directory of Open Access Journals</source><source>EZB Electronic Journals Library</source><creator>Russo, Elena Tea ; Barone, Federico ; Bateman, Alex ; Cozzini, Stefano ; Punta, Marco ; Laio, Alessandro</creator><contributor>Dunbrack, Roland L.</contributor><creatorcontrib>Russo, Elena Tea ; Barone, Federico ; Bateman, Alex ; Cozzini, Stefano ; Punta, Marco ; Laio, Alessandro ; Dunbrack, Roland L.</creatorcontrib><description>Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.</description><identifier>ISSN: 1553-7358</identifier><identifier>ISSN: 1553-734X</identifier><identifier>EISSN: 1553-7358</identifier><identifier>DOI: 10.1371/journal.pcbi.1010610</identifier><identifier>PMID: 36260616</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Algorithms ; Amino Acid Sequence ; Annotations ; Automatic classification ; Biology and Life Sciences ; Classification ; Cluster Analysis ; Clustering ; Databases, Protein ; Datasets ; Density ; Domains ; Hypotheses ; Identification and classification ; Methods ; Protein Domains ; Protein families ; Proteins ; Proteins - genetics ; Research and Analysis Methods ; Sequences</subject><ispartof>PLoS computational biology, 2022-10, Vol.18 (10), p.e1010610</ispartof><rights>COPYRIGHT 2022 Public Library of Science</rights><rights>2022 Russo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2022 Russo et al 2022 Russo et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c633t-2ac079bb16c3b55007c422e549ee554074dd463c22318034c9b7ae367b0c55973</citedby><cites>FETCH-LOGICAL-c633t-2ac079bb16c3b55007c422e549ee554074dd463c22318034c9b7ae367b0c55973</cites><orcidid>0000-0002-6982-4660 ; 0000-0001-5696-670X ; 0000-0002-0061-2328 ; 0000-0001-9164-7907</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9621593/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9621593/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,864,885,2102,2928,23866,27924,27925,53791,53793,79600,79601</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/36260616$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Dunbrack, Roland L.</contributor><creatorcontrib>Russo, Elena Tea</creatorcontrib><creatorcontrib>Barone, Federico</creatorcontrib><creatorcontrib>Bateman, Alex</creatorcontrib><creatorcontrib>Cozzini, Stefano</creatorcontrib><creatorcontrib>Punta, Marco</creatorcontrib><creatorcontrib>Laio, Alessandro</creatorcontrib><title>DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets</title><title>PLoS computational biology</title><addtitle>PLoS Comput Biol</addtitle><description>Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.</description><subject>Algorithms</subject><subject>Amino Acid Sequence</subject><subject>Annotations</subject><subject>Automatic classification</subject><subject>Biology and Life Sciences</subject><subject>Classification</subject><subject>Cluster Analysis</subject><subject>Clustering</subject><subject>Databases, Protein</subject><subject>Datasets</subject><subject>Density</subject><subject>Domains</subject><subject>Hypotheses</subject><subject>Identification and classification</subject><subject>Methods</subject><subject>Protein Domains</subject><subject>Protein families</subject><subject>Proteins</subject><subject>Proteins - genetics</subject><subject>Research and Analysis Methods</subject><subject>Sequences</subject><issn>1553-7358</issn><issn>1553-734X</issn><issn>1553-7358</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>DOA</sourceid><recordid>eNqVkk1vEzEQhlcIREvhHyCwxAUOCfb6K-4BqUr5iFRBBfRseb2zW4eNHWxvRf49DkmrBnFBPnjkeead8aupqucETwmV5O0yjNGbYbq2jZsSTLAg-EF1TDinE0n57OG9-Kh6ktIS4xIq8bg6oqIWhRfH1fX55bwzq1N05dO4hnjjErRoHUMG51HJuGGD7GBScp2zJrvgUbNB5-CTyxt0CeYHmg9jyhCd71Ho0GBiDyjBzxG8BdSabBLk9LR61JkhwbP9fVJdfXj_ff5pcvHl42J-djGxgtI8qY3FUjUNEZY2nGMsLatr4EwBcM6wZG3LBLV1TckMU2ZVIw1QIRtsOVeSnlQvd7rrISS9NynpWhbPGMVcFWKxI9pglnod3crEjQ7G6T8PIfbaxOzsAJqwtmvrGTUNIwwYn3HWqoZbpaDpJDdF692-29isoLXgczTDgehhxrtr3YcbrURNyixF4PVeIIbiWMp65ZKFYTAewriduxYMK0FZQV_9hf77d9Md1ZvyAee7UPraclpYORs8dK68n8maSU6VmpWCNwcFhcnwK_dmTEkvvn39D_bzIct2rI0hpQjdnSsE6-0G346vtxus9xtcyl7cd_Su6HZl6W-ZZu0O</recordid><startdate>20221019</startdate><enddate>20221019</enddate><creator>Russo, Elena Tea</creator><creator>Barone, Federico</creator><creator>Bateman, Alex</creator><creator>Cozzini, Stefano</creator><creator>Punta, Marco</creator><creator>Laio, Alessandro</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISN</scope><scope>ISR</scope><scope>3V.</scope><scope>7QO</scope><scope>7QP</scope><scope>7TK</scope><scope>7TM</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>K9.</scope><scope>LK8</scope><scope>M0N</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-6982-4660</orcidid><orcidid>https://orcid.org/0000-0001-5696-670X</orcidid><orcidid>https://orcid.org/0000-0002-0061-2328</orcidid><orcidid>https://orcid.org/0000-0001-9164-7907</orcidid></search><sort><creationdate>20221019</creationdate><title>DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets</title><author>Russo, Elena Tea ; Barone, Federico ; Bateman, Alex ; Cozzini, Stefano ; Punta, Marco ; Laio, Alessandro</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c633t-2ac079bb16c3b55007c422e549ee554074dd463c22318034c9b7ae367b0c55973</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Amino Acid Sequence</topic><topic>Annotations</topic><topic>Automatic classification</topic><topic>Biology and Life Sciences</topic><topic>Classification</topic><topic>Cluster Analysis</topic><topic>Clustering</topic><topic>Databases, Protein</topic><topic>Datasets</topic><topic>Density</topic><topic>Domains</topic><topic>Hypotheses</topic><topic>Identification and classification</topic><topic>Methods</topic><topic>Protein Domains</topic><topic>Protein families</topic><topic>Proteins</topic><topic>Proteins - genetics</topic><topic>Research and Analysis Methods</topic><topic>Sequences</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Russo, Elena Tea</creatorcontrib><creatorcontrib>Barone, Federico</creatorcontrib><creatorcontrib>Bateman, Alex</creatorcontrib><creatorcontrib>Cozzini, Stefano</creatorcontrib><creatorcontrib>Punta, Marco</creatorcontrib><creatorcontrib>Laio, Alessandro</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Canada</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium &amp; Calcified Tissue Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>ProQuest Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Biological Sciences</collection><collection>Computing Database</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>PML(ProQuest Medical Library)</collection><collection>ProQuest Biological Science Journals</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>ProQuest Publicly Available Content database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>Directory of Open Access Journals</collection><jtitle>PLoS computational biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Russo, Elena Tea</au><au>Barone, Federico</au><au>Bateman, Alex</au><au>Cozzini, Stefano</au><au>Punta, Marco</au><au>Laio, Alessandro</au><au>Dunbrack, Roland L.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets</atitle><jtitle>PLoS computational biology</jtitle><addtitle>PLoS Comput Biol</addtitle><date>2022-10-19</date><risdate>2022</risdate><volume>18</volume><issue>10</issue><spage>e1010610</spage><pages>e1010610-</pages><issn>1553-7358</issn><issn>1553-734X</issn><eissn>1553-7358</eissn><abstract>Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>36260616</pmid><doi>10.1371/journal.pcbi.1010610</doi><tpages>e1010610</tpages><orcidid>https://orcid.org/0000-0002-6982-4660</orcidid><orcidid>https://orcid.org/0000-0001-5696-670X</orcidid><orcidid>https://orcid.org/0000-0002-0061-2328</orcidid><orcidid>https://orcid.org/0000-0001-9164-7907</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1553-7358
ispartof PLoS computational biology, 2022-10, Vol.18 (10), p.e1010610
issn 1553-7358
1553-734X
1553-7358
language eng
recordid cdi_plos_journals_2737143059
source PubMed (Medline); MEDLINE; Public Library of Science; Directory of Open Access Journals; EZB Electronic Journals Library
subjects Algorithms
Amino Acid Sequence
Annotations
Automatic classification
Biology and Life Sciences
Classification
Cluster Analysis
Clustering
Databases, Protein
Datasets
Density
Domains
Hypotheses
Identification and classification
Methods
Protein Domains
Protein families
Proteins
Proteins - genetics
Research and Analysis Methods
Sequences
title DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T05%3A52%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=DPCfam:%20Unsupervised%20protein%20family%20classification%20by%20Density%20Peak%20Clustering%20of%20large%20sequence%20datasets&rft.jtitle=PLoS%20computational%20biology&rft.au=Russo,%20Elena%20Tea&rft.date=2022-10-19&rft.volume=18&rft.issue=10&rft.spage=e1010610&rft.pages=e1010610-&rft.issn=1553-7358&rft.eissn=1553-7358&rft_id=info:doi/10.1371/journal.pcbi.1010610&rft_dat=%3Cgale_plos_%3EA724753998%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2737143059&rft_id=info:pmid/36260616&rft_galeid=A724753998&rft_doaj_id=oai_doaj_org_article_14dfd283ab414e45854d9b5c99ebf75a&rfr_iscdi=true