Using machine learning tools for protein database biocuration assistance

Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challe...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Scientific reports 2018-07, Vol.8 (1), p.10148-10, Article 10148
Hauptverfasser: König, Caroline, Shaim, Ilmira, Vellido, Alfredo, Romero, Enrique, Alquézar, René, Giraldo, Jesús
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 10
container_issue 1
container_start_page 10148
container_title Scientific reports
container_volume 8
creator König, Caroline
Shaim, Ilmira
Vellido, Alfredo
Romero, Enrique
Alquézar, René
Giraldo, Jesús
description Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise , as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.
doi_str_mv 10.1038/s41598-018-28330-z
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_6033909</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2064743165</sourcerecordid><originalsourceid>FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</originalsourceid><addsrcrecordid>eNp9UU1vEzEQtRAVrdr-gR7QSly4bPHX-uOChCqgSJW4tGfLO5mkrjZ2sL1I9NfjNGkoHBjJsj3z3huPHyEXjF4yKsyHItlgTU-Z6bkRgvaPr8gJp3LoueD89YvzMTkv5YG2GLiVzL4hx9xaralmJ-T6roS46tYe7kPEbkKf4zZRU5pKt0y52-RUMcRu4asffcFuDAnm7GtIsfOlhFJ9BDwjR0s_FTzf76fk7svn26vr_ub7129Xn256GLSsPVBQHLRQYjBmQI1gDF-wYQTBjVRoJAUuQUALJf1Ily2HhnJFJXotxCn5uNPdzOMaF4CxZj-5TQ5rn3-55IP7uxLDvVuln05RISy1TYDtBKDM4DICZvD1iXi4bBenmjshmbCmcd7vm-b0Y8ZS3ToUwGnyEdNcGlYpqa2kskHf_QN9SHOO7Uu2KKmlYGpoKL5_RE6lZFweBmDUbf11O39d89c9-eseG-nty9EPlGc3G0DsAKWV4grzn97_kf0NhKGxeg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2064743165</pqid></control><display><type>article</type><title>Using machine learning tools for protein database biocuration assistance</title><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Nature Free</source><source>Recercat</source><source>PubMed Central</source><source>Springer Nature OA/Free Journals</source><source>Free Full-Text Journals in Chemistry</source><creator>König, Caroline ; Shaim, Ilmira ; Vellido, Alfredo ; Romero, Enrique ; Alquézar, René ; Giraldo, Jesús</creator><creatorcontrib>König, Caroline ; Shaim, Ilmira ; Vellido, Alfredo ; Romero, Enrique ; Alquézar, René ; Giraldo, Jesús</creatorcontrib><description>Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise , as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.</description><identifier>ISSN: 2045-2322</identifier><identifier>EISSN: 2045-2322</identifier><identifier>DOI: 10.1038/s41598-018-28330-z</identifier><identifier>PMID: 29977071</identifier><language>eng</language><publisher>London: Nature Publishing Group UK</publisher><subject>631/114/129/2044 ; 631/114/2164 ; Aplicacions de la informàtica ; Aprenentatge automàtic ; Artificial intelligence ; Biocuration ; Bioinformàtica ; Biological knowledge dissemination ; Cell interactions ; Cell membranes ; Data mining ; G protein-coupled receptors ; G Protein-Coupled Receptors (GPCRs) ; Humanities and Social Sciences ; Information retrieval ; Informàtica ; Learning algorithms ; Machine learning ; Mineria de dades ; multidisciplinary ; Omics sciences ; Pharmacology ; Proteins ; Proteomics ; Proteòmica ; Recuperació de la informació ; Science ; Science (multidisciplinary) ; Sistemes d'informació ; Àrees temàtiques de la UPC</subject><ispartof>Scientific reports, 2018-07, Vol.8 (1), p.10148-10, Article 10148</ispartof><rights>The Author(s) 2018</rights><rights>Copyright Nature Publishing Group Jul 2018</rights><rights>Attribution 3.0 Spain info:eu-repo/semantics/openAccess &lt;a href="http://creativecommons.org/licenses/by/3.0/es/"&gt;http://creativecommons.org/licenses/by/3.0/es/&lt;/a&gt;</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</citedby><cites>FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</cites><orcidid>0000-0002-6420-0517 ; 0000-0002-9843-1911</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC6033909/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC6033909/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,315,728,781,785,865,886,26979,27929,27930,41125,42194,51581,53796,53798</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/29977071$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>König, Caroline</creatorcontrib><creatorcontrib>Shaim, Ilmira</creatorcontrib><creatorcontrib>Vellido, Alfredo</creatorcontrib><creatorcontrib>Romero, Enrique</creatorcontrib><creatorcontrib>Alquézar, René</creatorcontrib><creatorcontrib>Giraldo, Jesús</creatorcontrib><title>Using machine learning tools for protein database biocuration assistance</title><title>Scientific reports</title><addtitle>Sci Rep</addtitle><addtitle>Sci Rep</addtitle><description>Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise , as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.</description><subject>631/114/129/2044</subject><subject>631/114/2164</subject><subject>Aplicacions de la informàtica</subject><subject>Aprenentatge automàtic</subject><subject>Artificial intelligence</subject><subject>Biocuration</subject><subject>Bioinformàtica</subject><subject>Biological knowledge dissemination</subject><subject>Cell interactions</subject><subject>Cell membranes</subject><subject>Data mining</subject><subject>G protein-coupled receptors</subject><subject>G Protein-Coupled Receptors (GPCRs)</subject><subject>Humanities and Social Sciences</subject><subject>Information retrieval</subject><subject>Informàtica</subject><subject>Learning algorithms</subject><subject>Machine learning</subject><subject>Mineria de dades</subject><subject>multidisciplinary</subject><subject>Omics sciences</subject><subject>Pharmacology</subject><subject>Proteins</subject><subject>Proteomics</subject><subject>Proteòmica</subject><subject>Recuperació de la informació</subject><subject>Science</subject><subject>Science (multidisciplinary)</subject><subject>Sistemes d'informació</subject><subject>Àrees temàtiques de la UPC</subject><issn>2045-2322</issn><issn>2045-2322</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>XX2</sourceid><recordid>eNp9UU1vEzEQtRAVrdr-gR7QSly4bPHX-uOChCqgSJW4tGfLO5mkrjZ2sL1I9NfjNGkoHBjJsj3z3huPHyEXjF4yKsyHItlgTU-Z6bkRgvaPr8gJp3LoueD89YvzMTkv5YG2GLiVzL4hx9xaralmJ-T6roS46tYe7kPEbkKf4zZRU5pKt0y52-RUMcRu4asffcFuDAnm7GtIsfOlhFJ9BDwjR0s_FTzf76fk7svn26vr_ub7129Xn256GLSsPVBQHLRQYjBmQI1gDF-wYQTBjVRoJAUuQUALJf1Ily2HhnJFJXotxCn5uNPdzOMaF4CxZj-5TQ5rn3-55IP7uxLDvVuln05RISy1TYDtBKDM4DICZvD1iXi4bBenmjshmbCmcd7vm-b0Y8ZS3ToUwGnyEdNcGlYpqa2kskHf_QN9SHOO7Uu2KKmlYGpoKL5_RE6lZFweBmDUbf11O39d89c9-eseG-nty9EPlGc3G0DsAKWV4grzn97_kf0NhKGxeg</recordid><startdate>20180705</startdate><enddate>20180705</enddate><creator>König, Caroline</creator><creator>Shaim, Ilmira</creator><creator>Vellido, Alfredo</creator><creator>Romero, Enrique</creator><creator>Alquézar, René</creator><creator>Giraldo, Jesús</creator><general>Nature Publishing Group UK</general><general>Nature Publishing Group</general><general>Nature</general><scope>C6C</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>88A</scope><scope>88E</scope><scope>88I</scope><scope>8FE</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>LK8</scope><scope>M0S</scope><scope>M1P</scope><scope>M2P</scope><scope>M7P</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><scope>7X8</scope><scope>XX2</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-6420-0517</orcidid><orcidid>https://orcid.org/0000-0002-9843-1911</orcidid></search><sort><creationdate>20180705</creationdate><title>Using machine learning tools for protein database biocuration assistance</title><author>König, Caroline ; Shaim, Ilmira ; Vellido, Alfredo ; Romero, Enrique ; Alquézar, René ; Giraldo, Jesús</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>631/114/129/2044</topic><topic>631/114/2164</topic><topic>Aplicacions de la informàtica</topic><topic>Aprenentatge automàtic</topic><topic>Artificial intelligence</topic><topic>Biocuration</topic><topic>Bioinformàtica</topic><topic>Biological knowledge dissemination</topic><topic>Cell interactions</topic><topic>Cell membranes</topic><topic>Data mining</topic><topic>G protein-coupled receptors</topic><topic>G Protein-Coupled Receptors (GPCRs)</topic><topic>Humanities and Social Sciences</topic><topic>Information retrieval</topic><topic>Informàtica</topic><topic>Learning algorithms</topic><topic>Machine learning</topic><topic>Mineria de dades</topic><topic>multidisciplinary</topic><topic>Omics sciences</topic><topic>Pharmacology</topic><topic>Proteins</topic><topic>Proteomics</topic><topic>Proteòmica</topic><topic>Recuperació de la informació</topic><topic>Science</topic><topic>Science (multidisciplinary)</topic><topic>Sistemes d'informació</topic><topic>Àrees temàtiques de la UPC</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>König, Caroline</creatorcontrib><creatorcontrib>Shaim, Ilmira</creatorcontrib><creatorcontrib>Vellido, Alfredo</creatorcontrib><creatorcontrib>Romero, Enrique</creatorcontrib><creatorcontrib>Alquézar, René</creatorcontrib><creatorcontrib>Giraldo, Jesús</creatorcontrib><collection>Springer Nature OA/Free Journals</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Biology Database (Alumni Edition)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Science Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Science Database</collection><collection>Biological Science Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><collection>MEDLINE - Academic</collection><collection>Recercat</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Scientific reports</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>König, Caroline</au><au>Shaim, Ilmira</au><au>Vellido, Alfredo</au><au>Romero, Enrique</au><au>Alquézar, René</au><au>Giraldo, Jesús</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Using machine learning tools for protein database biocuration assistance</atitle><jtitle>Scientific reports</jtitle><stitle>Sci Rep</stitle><addtitle>Sci Rep</addtitle><date>2018-07-05</date><risdate>2018</risdate><volume>8</volume><issue>1</issue><spage>10148</spage><epage>10</epage><pages>10148-10</pages><artnum>10148</artnum><issn>2045-2322</issn><eissn>2045-2322</eissn><abstract>Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise , as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.</abstract><cop>London</cop><pub>Nature Publishing Group UK</pub><pmid>29977071</pmid><doi>10.1038/s41598-018-28330-z</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-6420-0517</orcidid><orcidid>https://orcid.org/0000-0002-9843-1911</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2045-2322
ispartof Scientific reports, 2018-07, Vol.8 (1), p.10148-10, Article 10148
issn 2045-2322
2045-2322
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_6033909
source DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Nature Free; Recercat; PubMed Central; Springer Nature OA/Free Journals; Free Full-Text Journals in Chemistry
subjects 631/114/129/2044
631/114/2164
Aplicacions de la informàtica
Aprenentatge automàtic
Artificial intelligence
Biocuration
Bioinformàtica
Biological knowledge dissemination
Cell interactions
Cell membranes
Data mining
G protein-coupled receptors
G Protein-Coupled Receptors (GPCRs)
Humanities and Social Sciences
Information retrieval
Informàtica
Learning algorithms
Machine learning
Mineria de dades
multidisciplinary
Omics sciences
Pharmacology
Proteins
Proteomics
Proteòmica
Recuperació de la informació
Science
Science (multidisciplinary)
Sistemes d'informació
Àrees temàtiques de la UPC
title Using machine learning tools for protein database biocuration assistance
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-15T19%3A31%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Using%20machine%20learning%20tools%20for%20protein%20database%20biocuration%20assistance&rft.jtitle=Scientific%20reports&rft.au=K%C3%B6nig,%20Caroline&rft.date=2018-07-05&rft.volume=8&rft.issue=1&rft.spage=10148&rft.epage=10&rft.pages=10148-10&rft.artnum=10148&rft.issn=2045-2322&rft.eissn=2045-2322&rft_id=info:doi/10.1038/s41598-018-28330-z&rft_dat=%3Cproquest_pubme%3E2064743165%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2064743165&rft_id=info:pmid/29977071&rfr_iscdi=true