An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins

Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:PloS one 2012-11, Vol.7 (11), p.e49716-e49716
Hauptverfasser: Zheng, Cheng, Wang, Mingjun, Takemoto, Kazuhiro, Akutsu, Tatsuya, Zhang, Ziding, Song, Jiangning
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page e49716
container_issue 11
container_start_page e49716
container_title PloS one
container_volume 7
creator Zheng, Cheng
Wang, Mingjun
Takemoto, Kazuhiro
Akutsu, Tatsuya
Zhang, Ziding
Song, Jiangning
description Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved >80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/.
doi_str_mv 10.1371/journal.pone.0049716
format Article
fullrecord <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_1339168596</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A477090549</galeid><doaj_id>oai_doaj_org_article_ebd6b62b0bfa4adb82b65bc7cb2c7c2c</doaj_id><sourcerecordid>A477090549</sourcerecordid><originalsourceid>FETCH-LOGICAL-c758t-3a5e5d25cd7b7a53a499ac6ab7a91ae82af459ec9073915577a4c274af06e9e33</originalsourceid><addsrcrecordid>eNqNk1trFTEQxxdRbK1-A9GAIPpwjslmk5x9EUrxUigUvL2G2ezsntTdZJtkW_UT-LHNsaelR_oggVx_809mMlMUTxldMq7YmzM_BwfDcvIOl5RWtWLyXrHPal4uZEn5_VvzveJRjGeUCr6S8mGxV3ImpRJ8v_h96Ih1CfsAyV4gMX6c5pTnPmuTLsCIlz58Jw1EbIl3BEi69IuYcCIBXOtH0vmAMREYeh9sWo_EjlPwFxjJFLC1ZqNFfEd-WWcWjXWtdT2JNmXAusz4hNbFx8WDDoaIT7bjQfH1_bsvRx8XJ6cfjo8OTxZGiVVacBAo2lKYVjUKBIeqrsFIyIuaAa5K6CpRo6mp4jUTQimoTKkq6KjEGjk_KJ5f6U6Dj3obxKgZz7hciVpm4viKaD2c6SnYEcJP7cHqvxs-9BpCsmZAjU0rG1k2tOmggrZZlY0UjVGmKXNXmqz1dnvb3IzYGnQpwLAjunvi7Fr3_kLz7BetaBZ4tRUI_nzOcdajjQaHARz6Ob-bqVowJjnL6It_0Lu921I9ZAes63y-12xE9WGlFK2pqOpMLe-gcmtxtCZnXGfz_o7B6x2DzCT8kXqYY9THnz_9P3v6bZd9eYtdIwxpHf0wb5Iq7oLVFWiCjzFgdxNkRvWmYK6joTcFo7cFk82e3f6gG6PrCuF_AFoAFKA</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1339168596</pqid></control><display><type>article</type><title>An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>PubMed Central</source><source>Free Full-Text Journals in Chemistry</source><source>Public Library of Science (PLoS)</source><creator>Zheng, Cheng ; Wang, Mingjun ; Takemoto, Kazuhiro ; Akutsu, Tatsuya ; Zhang, Ziding ; Song, Jiangning</creator><creatorcontrib>Zheng, Cheng ; Wang, Mingjun ; Takemoto, Kazuhiro ; Akutsu, Tatsuya ; Zhang, Ziding ; Song, Jiangning</creatorcontrib><description>Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved &gt;80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/.</description><identifier>ISSN: 1932-6203</identifier><identifier>EISSN: 1932-6203</identifier><identifier>DOI: 10.1371/journal.pone.0049716</identifier><identifier>PMID: 23166753</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Algorithms ; Amino acids ; Amino Acids - chemistry ; Amino Acids - metabolism ; Apoproteins - chemistry ; Apoproteins - metabolism ; Artificial intelligence ; Banks (Finance) ; Binding proteins ; Binding Sites ; Bioinformatics ; Biology ; Biotechnology ; Catalysis ; Computational Biology - methods ; Computer applications ; Enzymes ; Forests ; Graph theory ; Identification methods ; Information retrieval ; International conferences ; Internet ; Laboratories ; Mathematics ; Metalloproteins ; Metalloproteins - chemistry ; Metalloproteins - metabolism ; Methods ; Models, Biological ; Neural networks ; Protein Binding ; Proteins ; Recall ; Reproducibility of Results ; Residues ; ROC Curve ; Solvents ; Zinc ; Zinc - chemistry ; Zinc - metabolism</subject><ispartof>PloS one, 2012-11, Vol.7 (11), p.e49716-e49716</ispartof><rights>COPYRIGHT 2012 Public Library of Science</rights><rights>2012 Zheng et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2012 Zheng et al 2012 Zheng et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c758t-3a5e5d25cd7b7a53a499ac6ab7a91ae82af459ec9073915577a4c274af06e9e33</citedby><cites>FETCH-LOGICAL-c758t-3a5e5d25cd7b7a53a499ac6ab7a91ae82af459ec9073915577a4c274af06e9e33</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3499040/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3499040/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,2096,2915,23845,27901,27902,53766,53768,79342,79343</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/23166753$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Zheng, Cheng</creatorcontrib><creatorcontrib>Wang, Mingjun</creatorcontrib><creatorcontrib>Takemoto, Kazuhiro</creatorcontrib><creatorcontrib>Akutsu, Tatsuya</creatorcontrib><creatorcontrib>Zhang, Ziding</creatorcontrib><creatorcontrib>Song, Jiangning</creatorcontrib><title>An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins</title><title>PloS one</title><addtitle>PLoS One</addtitle><description>Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved &gt;80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/.</description><subject>Algorithms</subject><subject>Amino acids</subject><subject>Amino Acids - chemistry</subject><subject>Amino Acids - metabolism</subject><subject>Apoproteins - chemistry</subject><subject>Apoproteins - metabolism</subject><subject>Artificial intelligence</subject><subject>Banks (Finance)</subject><subject>Binding proteins</subject><subject>Binding Sites</subject><subject>Bioinformatics</subject><subject>Biology</subject><subject>Biotechnology</subject><subject>Catalysis</subject><subject>Computational Biology - methods</subject><subject>Computer applications</subject><subject>Enzymes</subject><subject>Forests</subject><subject>Graph theory</subject><subject>Identification methods</subject><subject>Information retrieval</subject><subject>International conferences</subject><subject>Internet</subject><subject>Laboratories</subject><subject>Mathematics</subject><subject>Metalloproteins</subject><subject>Metalloproteins - chemistry</subject><subject>Metalloproteins - metabolism</subject><subject>Methods</subject><subject>Models, Biological</subject><subject>Neural networks</subject><subject>Protein Binding</subject><subject>Proteins</subject><subject>Recall</subject><subject>Reproducibility of Results</subject><subject>Residues</subject><subject>ROC Curve</subject><subject>Solvents</subject><subject>Zinc</subject><subject>Zinc - chemistry</subject><subject>Zinc - metabolism</subject><issn>1932-6203</issn><issn>1932-6203</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2012</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>BENPR</sourceid><sourceid>DOA</sourceid><recordid>eNqNk1trFTEQxxdRbK1-A9GAIPpwjslmk5x9EUrxUigUvL2G2ezsntTdZJtkW_UT-LHNsaelR_oggVx_809mMlMUTxldMq7YmzM_BwfDcvIOl5RWtWLyXrHPal4uZEn5_VvzveJRjGeUCr6S8mGxV3ImpRJ8v_h96Ih1CfsAyV4gMX6c5pTnPmuTLsCIlz58Jw1EbIl3BEi69IuYcCIBXOtH0vmAMREYeh9sWo_EjlPwFxjJFLC1ZqNFfEd-WWcWjXWtdT2JNmXAusz4hNbFx8WDDoaIT7bjQfH1_bsvRx8XJ6cfjo8OTxZGiVVacBAo2lKYVjUKBIeqrsFIyIuaAa5K6CpRo6mp4jUTQimoTKkq6KjEGjk_KJ5f6U6Dj3obxKgZz7hciVpm4viKaD2c6SnYEcJP7cHqvxs-9BpCsmZAjU0rG1k2tOmggrZZlY0UjVGmKXNXmqz1dnvb3IzYGnQpwLAjunvi7Fr3_kLz7BetaBZ4tRUI_nzOcdajjQaHARz6Ob-bqVowJjnL6It_0Lu921I9ZAes63y-12xE9WGlFK2pqOpMLe-gcmtxtCZnXGfz_o7B6x2DzCT8kXqYY9THnz_9P3v6bZd9eYtdIwxpHf0wb5Iq7oLVFWiCjzFgdxNkRvWmYK6joTcFo7cFk82e3f6gG6PrCuF_AFoAFKA</recordid><startdate>20121114</startdate><enddate>20121114</enddate><creator>Zheng, Cheng</creator><creator>Wang, Mingjun</creator><creator>Takemoto, Kazuhiro</creator><creator>Akutsu, Tatsuya</creator><creator>Zhang, Ziding</creator><creator>Song, Jiangning</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>IOV</scope><scope>ISR</scope><scope>3V.</scope><scope>7QG</scope><scope>7QL</scope><scope>7QO</scope><scope>7RV</scope><scope>7SN</scope><scope>7SS</scope><scope>7T5</scope><scope>7TG</scope><scope>7TM</scope><scope>7U9</scope><scope>7X2</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AO</scope><scope>8C1</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>ATCPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>C1K</scope><scope>CCPQU</scope><scope>D1I</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>H94</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>KB.</scope><scope>KB0</scope><scope>KL.</scope><scope>L6V</scope><scope>LK8</scope><scope>M0K</scope><scope>M0S</scope><scope>M1P</scope><scope>M7N</scope><scope>M7P</scope><scope>M7S</scope><scope>NAPCQ</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PATMY</scope><scope>PDBOC</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>PYCSY</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope></search><sort><creationdate>20121114</creationdate><title>An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins</title><author>Zheng, Cheng ; Wang, Mingjun ; Takemoto, Kazuhiro ; Akutsu, Tatsuya ; Zhang, Ziding ; Song, Jiangning</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c758t-3a5e5d25cd7b7a53a499ac6ab7a91ae82af459ec9073915577a4c274af06e9e33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Algorithms</topic><topic>Amino acids</topic><topic>Amino Acids - chemistry</topic><topic>Amino Acids - metabolism</topic><topic>Apoproteins - chemistry</topic><topic>Apoproteins - metabolism</topic><topic>Artificial intelligence</topic><topic>Banks (Finance)</topic><topic>Binding proteins</topic><topic>Binding Sites</topic><topic>Bioinformatics</topic><topic>Biology</topic><topic>Biotechnology</topic><topic>Catalysis</topic><topic>Computational Biology - methods</topic><topic>Computer applications</topic><topic>Enzymes</topic><topic>Forests</topic><topic>Graph theory</topic><topic>Identification methods</topic><topic>Information retrieval</topic><topic>International conferences</topic><topic>Internet</topic><topic>Laboratories</topic><topic>Mathematics</topic><topic>Metalloproteins</topic><topic>Metalloproteins - chemistry</topic><topic>Metalloproteins - metabolism</topic><topic>Methods</topic><topic>Models, Biological</topic><topic>Neural networks</topic><topic>Protein Binding</topic><topic>Proteins</topic><topic>Recall</topic><topic>Reproducibility of Results</topic><topic>Residues</topic><topic>ROC Curve</topic><topic>Solvents</topic><topic>Zinc</topic><topic>Zinc - chemistry</topic><topic>Zinc - metabolism</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zheng, Cheng</creatorcontrib><creatorcontrib>Wang, Mingjun</creatorcontrib><creatorcontrib>Takemoto, Kazuhiro</creatorcontrib><creatorcontrib>Akutsu, Tatsuya</creatorcontrib><creatorcontrib>Zhang, Ziding</creatorcontrib><creatorcontrib>Song, Jiangning</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Opposing Viewpoints</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Animal Behavior Abstracts</collection><collection>Bacteriology Abstracts (Microbiology B)</collection><collection>Biotechnology Research Abstracts</collection><collection>Nursing &amp; Allied Health Database</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Immunology Abstracts</collection><collection>Meteorological &amp; Geoastrophysical Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Virology and AIDS Abstracts</collection><collection>Agricultural Science Collection</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Public Health Database</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>Agricultural &amp; Environmental Science Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>Environmental Sciences and Pollution Management</collection><collection>ProQuest One Community College</collection><collection>ProQuest Materials Science Collection</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Materials Science Database</collection><collection>Nursing &amp; Allied Health Database (Alumni Edition)</collection><collection>Meteorological &amp; Geoastrophysical Abstracts - Academic</collection><collection>ProQuest Engineering Collection</collection><collection>ProQuest Biological Science Collection</collection><collection>Agricultural Science Database</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biological Science Database</collection><collection>Engineering Database</collection><collection>Nursing &amp; Allied Health Premium</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Environmental Science Database</collection><collection>Materials Science Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>Environmental Science Collection</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PloS one</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zheng, Cheng</au><au>Wang, Mingjun</au><au>Takemoto, Kazuhiro</au><au>Akutsu, Tatsuya</au><au>Zhang, Ziding</au><au>Song, Jiangning</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins</atitle><jtitle>PloS one</jtitle><addtitle>PLoS One</addtitle><date>2012-11-14</date><risdate>2012</risdate><volume>7</volume><issue>11</issue><spage>e49716</spage><epage>e49716</epage><pages>e49716-e49716</pages><issn>1932-6203</issn><eissn>1932-6203</eissn><abstract>Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved &gt;80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>23166753</pmid><doi>10.1371/journal.pone.0049716</doi><tpages>e49716</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1932-6203
ispartof PloS one, 2012-11, Vol.7 (11), p.e49716-e49716
issn 1932-6203
1932-6203
language eng
recordid cdi_plos_journals_1339168596
source MEDLINE; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; PubMed Central; Free Full-Text Journals in Chemistry; Public Library of Science (PLoS)
subjects Algorithms
Amino acids
Amino Acids - chemistry
Amino Acids - metabolism
Apoproteins - chemistry
Apoproteins - metabolism
Artificial intelligence
Banks (Finance)
Binding proteins
Binding Sites
Bioinformatics
Biology
Biotechnology
Catalysis
Computational Biology - methods
Computer applications
Enzymes
Forests
Graph theory
Identification methods
Information retrieval
International conferences
Internet
Laboratories
Mathematics
Metalloproteins
Metalloproteins - chemistry
Metalloproteins - metabolism
Methods
Models, Biological
Neural networks
Protein Binding
Proteins
Recall
Reproducibility of Results
Residues
ROC Curve
Solvents
Zinc
Zinc - chemistry
Zinc - metabolism
title An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T20%3A40%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20integrative%20computational%20framework%20based%20on%20a%20two-step%20random%20forest%20algorithm%20improves%20prediction%20of%20zinc-binding%20sites%20in%20proteins&rft.jtitle=PloS%20one&rft.au=Zheng,%20Cheng&rft.date=2012-11-14&rft.volume=7&rft.issue=11&rft.spage=e49716&rft.epage=e49716&rft.pages=e49716-e49716&rft.issn=1932-6203&rft.eissn=1932-6203&rft_id=info:doi/10.1371/journal.pone.0049716&rft_dat=%3Cgale_plos_%3EA477090549%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1339168596&rft_id=info:pmid/23166753&rft_galeid=A477090549&rft_doaj_id=oai_doaj_org_article_ebd6b62b0bfa4adb82b65bc7cb2c7c2c&rfr_iscdi=true