An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins
Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function...
Gespeichert in:
Veröffentlicht in: | PloS one 2012-11, Vol.7 (11), p.e49716-e49716 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | e49716 |
---|---|
container_issue | 11 |
container_start_page | e49716 |
container_title | PloS one |
container_volume | 7 |
creator | Zheng, Cheng Wang, Mingjun Takemoto, Kazuhiro Akutsu, Tatsuya Zhang, Ziding Song, Jiangning |
description | Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved >80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/. |
doi_str_mv | 10.1371/journal.pone.0049716 |
format | Article |
fullrecord | <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_1339168596</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A477090549</galeid><doaj_id>oai_doaj_org_article_ebd6b62b0bfa4adb82b65bc7cb2c7c2c</doaj_id><sourcerecordid>A477090549</sourcerecordid><originalsourceid>FETCH-LOGICAL-c758t-3a5e5d25cd7b7a53a499ac6ab7a91ae82af459ec9073915577a4c274af06e9e33</originalsourceid><addsrcrecordid>eNqNk1trFTEQxxdRbK1-A9GAIPpwjslmk5x9EUrxUigUvL2G2ezsntTdZJtkW_UT-LHNsaelR_oggVx_809mMlMUTxldMq7YmzM_BwfDcvIOl5RWtWLyXrHPal4uZEn5_VvzveJRjGeUCr6S8mGxV3ImpRJ8v_h96Ih1CfsAyV4gMX6c5pTnPmuTLsCIlz58Jw1EbIl3BEi69IuYcCIBXOtH0vmAMREYeh9sWo_EjlPwFxjJFLC1ZqNFfEd-WWcWjXWtdT2JNmXAusz4hNbFx8WDDoaIT7bjQfH1_bsvRx8XJ6cfjo8OTxZGiVVacBAo2lKYVjUKBIeqrsFIyIuaAa5K6CpRo6mp4jUTQimoTKkq6KjEGjk_KJ5f6U6Dj3obxKgZz7hciVpm4viKaD2c6SnYEcJP7cHqvxs-9BpCsmZAjU0rG1k2tOmggrZZlY0UjVGmKXNXmqz1dnvb3IzYGnQpwLAjunvi7Fr3_kLz7BetaBZ4tRUI_nzOcdajjQaHARz6Ob-bqVowJjnL6It_0Lu921I9ZAes63y-12xE9WGlFK2pqOpMLe-gcmtxtCZnXGfz_o7B6x2DzCT8kXqYY9THnz_9P3v6bZd9eYtdIwxpHf0wb5Iq7oLVFWiCjzFgdxNkRvWmYK6joTcFo7cFk82e3f6gG6PrCuF_AFoAFKA</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1339168596</pqid></control><display><type>article</type><title>An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>PubMed Central</source><source>Free Full-Text Journals in Chemistry</source><source>Public Library of Science (PLoS)</source><creator>Zheng, Cheng ; Wang, Mingjun ; Takemoto, Kazuhiro ; Akutsu, Tatsuya ; Zhang, Ziding ; Song, Jiangning</creator><creatorcontrib>Zheng, Cheng ; Wang, Mingjun ; Takemoto, Kazuhiro ; Akutsu, Tatsuya ; Zhang, Ziding ; Song, Jiangning</creatorcontrib><description>Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved >80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/.</description><identifier>ISSN: 1932-6203</identifier><identifier>EISSN: 1932-6203</identifier><identifier>DOI: 10.1371/journal.pone.0049716</identifier><identifier>PMID: 23166753</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Algorithms ; Amino acids ; Amino Acids - chemistry ; Amino Acids - metabolism ; Apoproteins - chemistry ; Apoproteins - metabolism ; Artificial intelligence ; Banks (Finance) ; Binding proteins ; Binding Sites ; Bioinformatics ; Biology ; Biotechnology ; Catalysis ; Computational Biology - methods ; Computer applications ; Enzymes ; Forests ; Graph theory ; Identification methods ; Information retrieval ; International conferences ; Internet ; Laboratories ; Mathematics ; Metalloproteins ; Metalloproteins - chemistry ; Metalloproteins - metabolism ; Methods ; Models, Biological ; Neural networks ; Protein Binding ; Proteins ; Recall ; Reproducibility of Results ; Residues ; ROC Curve ; Solvents ; Zinc ; Zinc - chemistry ; Zinc - metabolism</subject><ispartof>PloS one, 2012-11, Vol.7 (11), p.e49716-e49716</ispartof><rights>COPYRIGHT 2012 Public Library of Science</rights><rights>2012 Zheng et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2012 Zheng et al 2012 Zheng et al</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c758t-3a5e5d25cd7b7a53a499ac6ab7a91ae82af459ec9073915577a4c274af06e9e33</citedby><cites>FETCH-LOGICAL-c758t-3a5e5d25cd7b7a53a499ac6ab7a91ae82af459ec9073915577a4c274af06e9e33</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3499040/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3499040/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,2096,2915,23845,27901,27902,53766,53768,79342,79343</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/23166753$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Zheng, Cheng</creatorcontrib><creatorcontrib>Wang, Mingjun</creatorcontrib><creatorcontrib>Takemoto, Kazuhiro</creatorcontrib><creatorcontrib>Akutsu, Tatsuya</creatorcontrib><creatorcontrib>Zhang, Ziding</creatorcontrib><creatorcontrib>Song, Jiangning</creatorcontrib><title>An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins</title><title>PloS one</title><addtitle>PLoS One</addtitle><description>Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved >80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/.</description><subject>Algorithms</subject><subject>Amino acids</subject><subject>Amino Acids - chemistry</subject><subject>Amino Acids - metabolism</subject><subject>Apoproteins - chemistry</subject><subject>Apoproteins - metabolism</subject><subject>Artificial intelligence</subject><subject>Banks (Finance)</subject><subject>Binding proteins</subject><subject>Binding Sites</subject><subject>Bioinformatics</subject><subject>Biology</subject><subject>Biotechnology</subject><subject>Catalysis</subject><subject>Computational Biology - methods</subject><subject>Computer applications</subject><subject>Enzymes</subject><subject>Forests</subject><subject>Graph theory</subject><subject>Identification methods</subject><subject>Information retrieval</subject><subject>International conferences</subject><subject>Internet</subject><subject>Laboratories</subject><subject>Mathematics</subject><subject>Metalloproteins</subject><subject>Metalloproteins - chemistry</subject><subject>Metalloproteins - metabolism</subject><subject>Methods</subject><subject>Models, Biological</subject><subject>Neural networks</subject><subject>Protein Binding</subject><subject>Proteins</subject><subject>Recall</subject><subject>Reproducibility of Results</subject><subject>Residues</subject><subject>ROC Curve</subject><subject>Solvents</subject><subject>Zinc</subject><subject>Zinc - chemistry</subject><subject>Zinc - metabolism</subject><issn>1932-6203</issn><issn>1932-6203</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2012</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>BENPR</sourceid><sourceid>DOA</sourceid><recordid>eNqNk1trFTEQxxdRbK1-A9GAIPpwjslmk5x9EUrxUigUvL2G2ezsntTdZJtkW_UT-LHNsaelR_oggVx_809mMlMUTxldMq7YmzM_BwfDcvIOl5RWtWLyXrHPal4uZEn5_VvzveJRjGeUCr6S8mGxV3ImpRJ8v_h96Ih1CfsAyV4gMX6c5pTnPmuTLsCIlz58Jw1EbIl3BEi69IuYcCIBXOtH0vmAMREYeh9sWo_EjlPwFxjJFLC1ZqNFfEd-WWcWjXWtdT2JNmXAusz4hNbFx8WDDoaIT7bjQfH1_bsvRx8XJ6cfjo8OTxZGiVVacBAo2lKYVjUKBIeqrsFIyIuaAa5K6CpRo6mp4jUTQimoTKkq6KjEGjk_KJ5f6U6Dj3obxKgZz7hciVpm4viKaD2c6SnYEcJP7cHqvxs-9BpCsmZAjU0rG1k2tOmggrZZlY0UjVGmKXNXmqz1dnvb3IzYGnQpwLAjunvi7Fr3_kLz7BetaBZ4tRUI_nzOcdajjQaHARz6Ob-bqVowJjnL6It_0Lu921I9ZAes63y-12xE9WGlFK2pqOpMLe-gcmtxtCZnXGfz_o7B6x2DzCT8kXqYY9THnz_9P3v6bZd9eYtdIwxpHf0wb5Iq7oLVFWiCjzFgdxNkRvWmYK6joTcFo7cFk82e3f6gG6PrCuF_AFoAFKA</recordid><startdate>20121114</startdate><enddate>20121114</enddate><creator>Zheng, Cheng</creator><creator>Wang, Mingjun</creator><creator>Takemoto, Kazuhiro</creator><creator>Akutsu, Tatsuya</creator><creator>Zhang, Ziding</creator><creator>Song, Jiangning</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>IOV</scope><scope>ISR</scope><scope>3V.</scope><scope>7QG</scope><scope>7QL</scope><scope>7QO</scope><scope>7RV</scope><scope>7SN</scope><scope>7SS</scope><scope>7T5</scope><scope>7TG</scope><scope>7TM</scope><scope>7U9</scope><scope>7X2</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AO</scope><scope>8C1</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>ATCPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>C1K</scope><scope>CCPQU</scope><scope>D1I</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>H94</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>KB.</scope><scope>KB0</scope><scope>KL.</scope><scope>L6V</scope><scope>LK8</scope><scope>M0K</scope><scope>M0S</scope><scope>M1P</scope><scope>M7N</scope><scope>M7P</scope><scope>M7S</scope><scope>NAPCQ</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PATMY</scope><scope>PDBOC</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>PYCSY</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope></search><sort><creationdate>20121114</creationdate><title>An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins</title><author>Zheng, Cheng ; Wang, Mingjun ; Takemoto, Kazuhiro ; Akutsu, Tatsuya ; Zhang, Ziding ; Song, Jiangning</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c758t-3a5e5d25cd7b7a53a499ac6ab7a91ae82af459ec9073915577a4c274af06e9e33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Algorithms</topic><topic>Amino acids</topic><topic>Amino Acids - chemistry</topic><topic>Amino Acids - metabolism</topic><topic>Apoproteins - chemistry</topic><topic>Apoproteins - metabolism</topic><topic>Artificial intelligence</topic><topic>Banks (Finance)</topic><topic>Binding proteins</topic><topic>Binding Sites</topic><topic>Bioinformatics</topic><topic>Biology</topic><topic>Biotechnology</topic><topic>Catalysis</topic><topic>Computational Biology - methods</topic><topic>Computer applications</topic><topic>Enzymes</topic><topic>Forests</topic><topic>Graph theory</topic><topic>Identification methods</topic><topic>Information retrieval</topic><topic>International conferences</topic><topic>Internet</topic><topic>Laboratories</topic><topic>Mathematics</topic><topic>Metalloproteins</topic><topic>Metalloproteins - chemistry</topic><topic>Metalloproteins - metabolism</topic><topic>Methods</topic><topic>Models, Biological</topic><topic>Neural networks</topic><topic>Protein Binding</topic><topic>Proteins</topic><topic>Recall</topic><topic>Reproducibility of Results</topic><topic>Residues</topic><topic>ROC Curve</topic><topic>Solvents</topic><topic>Zinc</topic><topic>Zinc - chemistry</topic><topic>Zinc - metabolism</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zheng, Cheng</creatorcontrib><creatorcontrib>Wang, Mingjun</creatorcontrib><creatorcontrib>Takemoto, Kazuhiro</creatorcontrib><creatorcontrib>Akutsu, Tatsuya</creatorcontrib><creatorcontrib>Zhang, Ziding</creatorcontrib><creatorcontrib>Song, Jiangning</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Opposing Viewpoints</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Animal Behavior Abstracts</collection><collection>Bacteriology Abstracts (Microbiology B)</collection><collection>Biotechnology Research Abstracts</collection><collection>Nursing & Allied Health Database</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Immunology Abstracts</collection><collection>Meteorological & Geoastrophysical Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Virology and AIDS Abstracts</collection><collection>Agricultural Science Collection</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Public Health Database</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>Agricultural & Environmental Science Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>Environmental Sciences and Pollution Management</collection><collection>ProQuest One Community College</collection><collection>ProQuest Materials Science Collection</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Materials Science Database</collection><collection>Nursing & Allied Health Database (Alumni Edition)</collection><collection>Meteorological & Geoastrophysical Abstracts - Academic</collection><collection>ProQuest Engineering Collection</collection><collection>ProQuest Biological Science Collection</collection><collection>Agricultural Science Database</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biological Science Database</collection><collection>Engineering Database</collection><collection>Nursing & Allied Health Premium</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Environmental Science Database</collection><collection>Materials Science Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>Environmental Science Collection</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PloS one</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zheng, Cheng</au><au>Wang, Mingjun</au><au>Takemoto, Kazuhiro</au><au>Akutsu, Tatsuya</au><au>Zhang, Ziding</au><au>Song, Jiangning</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins</atitle><jtitle>PloS one</jtitle><addtitle>PLoS One</addtitle><date>2012-11-14</date><risdate>2012</risdate><volume>7</volume><issue>11</issue><spage>e49716</spage><epage>e49716</epage><pages>e49716-e49716</pages><issn>1932-6203</issn><eissn>1932-6203</eissn><abstract>Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved >80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>23166753</pmid><doi>10.1371/journal.pone.0049716</doi><tpages>e49716</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1932-6203 |
ispartof | PloS one, 2012-11, Vol.7 (11), p.e49716-e49716 |
issn | 1932-6203 1932-6203 |
language | eng |
recordid | cdi_plos_journals_1339168596 |
source | MEDLINE; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; PubMed Central; Free Full-Text Journals in Chemistry; Public Library of Science (PLoS) |
subjects | Algorithms Amino acids Amino Acids - chemistry Amino Acids - metabolism Apoproteins - chemistry Apoproteins - metabolism Artificial intelligence Banks (Finance) Binding proteins Binding Sites Bioinformatics Biology Biotechnology Catalysis Computational Biology - methods Computer applications Enzymes Forests Graph theory Identification methods Information retrieval International conferences Internet Laboratories Mathematics Metalloproteins Metalloproteins - chemistry Metalloproteins - metabolism Methods Models, Biological Neural networks Protein Binding Proteins Recall Reproducibility of Results Residues ROC Curve Solvents Zinc Zinc - chemistry Zinc - metabolism |
title | An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T20%3A40%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20integrative%20computational%20framework%20based%20on%20a%20two-step%20random%20forest%20algorithm%20improves%20prediction%20of%20zinc-binding%20sites%20in%20proteins&rft.jtitle=PloS%20one&rft.au=Zheng,%20Cheng&rft.date=2012-11-14&rft.volume=7&rft.issue=11&rft.spage=e49716&rft.epage=e49716&rft.pages=e49716-e49716&rft.issn=1932-6203&rft.eissn=1932-6203&rft_id=info:doi/10.1371/journal.pone.0049716&rft_dat=%3Cgale_plos_%3EA477090549%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1339168596&rft_id=info:pmid/23166753&rft_galeid=A477090549&rft_doaj_id=oai_doaj_org_article_ebd6b62b0bfa4adb82b65bc7cb2c7c2c&rfr_iscdi=true |