Simple topological properties predict functional misannotations in a metabolic network

Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Bioinformatics 2013-07, Vol.29 (13), p.i154-i161
Hauptverfasser: Liberal, Rodrigo, Pinney, John W
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page i161
container_issue 13
container_start_page i154
container_title Bioinformatics
container_volume 29
creator Liberal, Rodrigo
Pinney, John W
description Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore, needed to assist in the identification of misannotated gene products. In the case of enzymatic functions, each functional assignment implies the existence of a reaction within the organism's metabolic network; a first approximation to a genome-scale metabolic model can be obtained directly from an automated genome annotation. Any obvious problems in the network, such as dead end or disconnected reactions, can, therefore, be strong indications of misannotation. We demonstrate that a machine-learning approach using only network topological features can successfully predict the validity of enzyme annotations. The predictions are tested at three different levels. A random forest using topological features of the metabolic network and trained on curated sets of correct and incorrect enzyme assignments was found to have an accuracy of up to 86% in 5-fold cross-validation experiments. Further cross-validation against unseen enzyme superfamilies indicates that this classifier can successfully extrapolate beyond the classes of enzyme present in the training data. The random forest model was applied to several automated genome annotations, achieving an accuracy of ~60% in most cases when validated against recent genome-scale metabolic models. We also observe that when applied to draft metabolic networks for multiple species, a clear negative correlation is observed between predicted annotation quality and phylogenetic distance to the major model organism for biochemistry (Escherichia coli for prokaryotes and Homo sapiens for eukaryotes). Supplementary data are available at Bioinformatics online.
doi_str_mv 10.1093/bioinformatics/btt236
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3694667</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1373433274</sourcerecordid><originalsourceid>FETCH-LOGICAL-c477t-1472e0d59e16f4b3b50f8d77d97a95b0d15f3c9ae6a21e31c435929fff9a7cc73</originalsourceid><addsrcrecordid>eNqFUctuFDEQtBCIvPgE0By5bGK7_VhfkFCUAFKkHAK5Wh5POxhm7MH2EuXvmdWGVXLKqavVVaXuLkLeM3rKqIGzPuaYQi6Ta9HXs741DuoVOWSg9EqsGXu9xxQOyFGtvyilkkr1lhxwWDNutDkktzdxmkfsWp7zmO-id2M3lzxjaRHrAnGIvnVhk3yLOS3TKVaXUm5u29cups51EzbX5zH6LmG7z-X3CXkT3Fjx3WM9Jj8uL76ff11dXX_5dv75auWF1m3FhOZIB2mQqSB66CUN60HrwWhnZE8HJgN441A5zhCYFyANNyEE47T3Go7Jp53vvOknHDymVtxo5xInVx5sdtE-n6T4097lvxaUEUptDT4-GpT8Z4O12eU-j-PoEuZNtUxpJhlwAy9TBeNS0eWxL1NBgwDgWixUuaP6kmstGPbLM2q3SdvnSdtd0ovuw9PL96r_0cI_c16tJA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1373433274</pqid></control><display><type>article</type><title>Simple topological properties predict functional misannotations in a metabolic network</title><source>Oxford Journals Open Access Collection</source><source>MEDLINE</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><creator>Liberal, Rodrigo ; Pinney, John W</creator><creatorcontrib>Liberal, Rodrigo ; Pinney, John W</creatorcontrib><description>Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore, needed to assist in the identification of misannotated gene products. In the case of enzymatic functions, each functional assignment implies the existence of a reaction within the organism's metabolic network; a first approximation to a genome-scale metabolic model can be obtained directly from an automated genome annotation. Any obvious problems in the network, such as dead end or disconnected reactions, can, therefore, be strong indications of misannotation. We demonstrate that a machine-learning approach using only network topological features can successfully predict the validity of enzyme annotations. The predictions are tested at three different levels. A random forest using topological features of the metabolic network and trained on curated sets of correct and incorrect enzyme assignments was found to have an accuracy of up to 86% in 5-fold cross-validation experiments. Further cross-validation against unseen enzyme superfamilies indicates that this classifier can successfully extrapolate beyond the classes of enzyme present in the training data. The random forest model was applied to several automated genome annotations, achieving an accuracy of ~60% in most cases when validated against recent genome-scale metabolic models. We also observe that when applied to draft metabolic networks for multiple species, a clear negative correlation is observed between predicted annotation quality and phylogenetic distance to the major model organism for biochemistry (Escherichia coli for prokaryotes and Homo sapiens for eukaryotes). Supplementary data are available at Bioinformatics online.</description><identifier>ISSN: 1367-4803</identifier><identifier>EISSN: 1367-4811</identifier><identifier>EISSN: 1460-2059</identifier><identifier>DOI: 10.1093/bioinformatics/btt236</identifier><identifier>PMID: 23812979</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Annotations ; Artificial Intelligence ; Automated ; Bioinformatics ; Enzymes ; Enzymes - classification ; Escherichia coli ; Genome ; Genomes ; Humans ; Mathematical models ; Metabolic Networks and Pathways ; Molecular Sequence Annotation ; Networks ; Phylogeny ; Plasmodium falciparum - enzymology ; Topology</subject><ispartof>Bioinformatics, 2013-07, Vol.29 (13), p.i154-i161</ispartof><rights>The Author 2013. Published by Oxford University Press. 2013</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c477t-1472e0d59e16f4b3b50f8d77d97a95b0d15f3c9ae6a21e31c435929fff9a7cc73</citedby><cites>FETCH-LOGICAL-c477t-1472e0d59e16f4b3b50f8d77d97a95b0d15f3c9ae6a21e31c435929fff9a7cc73</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694667/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694667/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,27901,27902,53766,53768</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/23812979$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Liberal, Rodrigo</creatorcontrib><creatorcontrib>Pinney, John W</creatorcontrib><title>Simple topological properties predict functional misannotations in a metabolic network</title><title>Bioinformatics</title><addtitle>Bioinformatics</addtitle><description>Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore, needed to assist in the identification of misannotated gene products. In the case of enzymatic functions, each functional assignment implies the existence of a reaction within the organism's metabolic network; a first approximation to a genome-scale metabolic model can be obtained directly from an automated genome annotation. Any obvious problems in the network, such as dead end or disconnected reactions, can, therefore, be strong indications of misannotation. We demonstrate that a machine-learning approach using only network topological features can successfully predict the validity of enzyme annotations. The predictions are tested at three different levels. A random forest using topological features of the metabolic network and trained on curated sets of correct and incorrect enzyme assignments was found to have an accuracy of up to 86% in 5-fold cross-validation experiments. Further cross-validation against unseen enzyme superfamilies indicates that this classifier can successfully extrapolate beyond the classes of enzyme present in the training data. The random forest model was applied to several automated genome annotations, achieving an accuracy of ~60% in most cases when validated against recent genome-scale metabolic models. We also observe that when applied to draft metabolic networks for multiple species, a clear negative correlation is observed between predicted annotation quality and phylogenetic distance to the major model organism for biochemistry (Escherichia coli for prokaryotes and Homo sapiens for eukaryotes). Supplementary data are available at Bioinformatics online.</description><subject>Annotations</subject><subject>Artificial Intelligence</subject><subject>Automated</subject><subject>Bioinformatics</subject><subject>Enzymes</subject><subject>Enzymes - classification</subject><subject>Escherichia coli</subject><subject>Genome</subject><subject>Genomes</subject><subject>Humans</subject><subject>Mathematical models</subject><subject>Metabolic Networks and Pathways</subject><subject>Molecular Sequence Annotation</subject><subject>Networks</subject><subject>Phylogeny</subject><subject>Plasmodium falciparum - enzymology</subject><subject>Topology</subject><issn>1367-4803</issn><issn>1367-4811</issn><issn>1460-2059</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqFUctuFDEQtBCIvPgE0By5bGK7_VhfkFCUAFKkHAK5Wh5POxhm7MH2EuXvmdWGVXLKqavVVaXuLkLeM3rKqIGzPuaYQi6Ta9HXs741DuoVOWSg9EqsGXu9xxQOyFGtvyilkkr1lhxwWDNutDkktzdxmkfsWp7zmO-id2M3lzxjaRHrAnGIvnVhk3yLOS3TKVaXUm5u29cups51EzbX5zH6LmG7z-X3CXkT3Fjx3WM9Jj8uL76ff11dXX_5dv75auWF1m3FhOZIB2mQqSB66CUN60HrwWhnZE8HJgN441A5zhCYFyANNyEE47T3Go7Jp53vvOknHDymVtxo5xInVx5sdtE-n6T4097lvxaUEUptDT4-GpT8Z4O12eU-j-PoEuZNtUxpJhlwAy9TBeNS0eWxL1NBgwDgWixUuaP6kmstGPbLM2q3SdvnSdtd0ovuw9PL96r_0cI_c16tJA</recordid><startdate>20130701</startdate><enddate>20130701</enddate><creator>Liberal, Rodrigo</creator><creator>Pinney, John W</creator><general>Oxford University Press</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>7QO</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>7SC</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>5PM</scope></search><sort><creationdate>20130701</creationdate><title>Simple topological properties predict functional misannotations in a metabolic network</title><author>Liberal, Rodrigo ; Pinney, John W</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c477t-1472e0d59e16f4b3b50f8d77d97a95b0d15f3c9ae6a21e31c435929fff9a7cc73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Annotations</topic><topic>Artificial Intelligence</topic><topic>Automated</topic><topic>Bioinformatics</topic><topic>Enzymes</topic><topic>Enzymes - classification</topic><topic>Escherichia coli</topic><topic>Genome</topic><topic>Genomes</topic><topic>Humans</topic><topic>Mathematical models</topic><topic>Metabolic Networks and Pathways</topic><topic>Molecular Sequence Annotation</topic><topic>Networks</topic><topic>Phylogeny</topic><topic>Plasmodium falciparum - enzymology</topic><topic>Topology</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liberal, Rodrigo</creatorcontrib><creatorcontrib>Pinney, John W</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>Biotechnology Research Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liberal, Rodrigo</au><au>Pinney, John W</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Simple topological properties predict functional misannotations in a metabolic network</atitle><jtitle>Bioinformatics</jtitle><addtitle>Bioinformatics</addtitle><date>2013-07-01</date><risdate>2013</risdate><volume>29</volume><issue>13</issue><spage>i154</spage><epage>i161</epage><pages>i154-i161</pages><issn>1367-4803</issn><eissn>1367-4811</eissn><eissn>1460-2059</eissn><abstract>Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore, needed to assist in the identification of misannotated gene products. In the case of enzymatic functions, each functional assignment implies the existence of a reaction within the organism's metabolic network; a first approximation to a genome-scale metabolic model can be obtained directly from an automated genome annotation. Any obvious problems in the network, such as dead end or disconnected reactions, can, therefore, be strong indications of misannotation. We demonstrate that a machine-learning approach using only network topological features can successfully predict the validity of enzyme annotations. The predictions are tested at three different levels. A random forest using topological features of the metabolic network and trained on curated sets of correct and incorrect enzyme assignments was found to have an accuracy of up to 86% in 5-fold cross-validation experiments. Further cross-validation against unseen enzyme superfamilies indicates that this classifier can successfully extrapolate beyond the classes of enzyme present in the training data. The random forest model was applied to several automated genome annotations, achieving an accuracy of ~60% in most cases when validated against recent genome-scale metabolic models. We also observe that when applied to draft metabolic networks for multiple species, a clear negative correlation is observed between predicted annotation quality and phylogenetic distance to the major model organism for biochemistry (Escherichia coli for prokaryotes and Homo sapiens for eukaryotes). Supplementary data are available at Bioinformatics online.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>23812979</pmid><doi>10.1093/bioinformatics/btt236</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1367-4803
ispartof Bioinformatics, 2013-07, Vol.29 (13), p.i154-i161
issn 1367-4803
1367-4811
1460-2059
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3694667
source Oxford Journals Open Access Collection; MEDLINE; EZB-FREE-00999 freely available EZB journals; PubMed Central; Alma/SFX Local Collection
subjects Annotations
Artificial Intelligence
Automated
Bioinformatics
Enzymes
Enzymes - classification
Escherichia coli
Genome
Genomes
Humans
Mathematical models
Metabolic Networks and Pathways
Molecular Sequence Annotation
Networks
Phylogeny
Plasmodium falciparum - enzymology
Topology
title Simple topological properties predict functional misannotations in a metabolic network
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T23%3A36%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Simple%20topological%20properties%20predict%20functional%20misannotations%20in%20a%20metabolic%20network&rft.jtitle=Bioinformatics&rft.au=Liberal,%20Rodrigo&rft.date=2013-07-01&rft.volume=29&rft.issue=13&rft.spage=i154&rft.epage=i161&rft.pages=i154-i161&rft.issn=1367-4803&rft.eissn=1367-4811&rft_id=info:doi/10.1093/bioinformatics/btt236&rft_dat=%3Cproquest_pubme%3E1373433274%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1373433274&rft_id=info:pmid/23812979&rfr_iscdi=true