Using machine learning tools for protein database biocuration assistance
Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challe...
Gespeichert in:
Veröffentlicht in: | Scientific reports 2018-07, Vol.8 (1), p.10148-10, Article 10148 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 10 |
---|---|
container_issue | 1 |
container_start_page | 10148 |
container_title | Scientific reports |
container_volume | 8 |
creator | König, Caroline Shaim, Ilmira Vellido, Alfredo Romero, Enrique Alquézar, René Giraldo, Jesús |
description | Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of
label noise
, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration. |
doi_str_mv | 10.1038/s41598-018-28330-z |
format | Article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_6033909</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2064743165</sourcerecordid><originalsourceid>FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</originalsourceid><addsrcrecordid>eNp9UU1vEzEQtRAVrdr-gR7QSly4bPHX-uOChCqgSJW4tGfLO5mkrjZ2sL1I9NfjNGkoHBjJsj3z3huPHyEXjF4yKsyHItlgTU-Z6bkRgvaPr8gJp3LoueD89YvzMTkv5YG2GLiVzL4hx9xaralmJ-T6roS46tYe7kPEbkKf4zZRU5pKt0y52-RUMcRu4asffcFuDAnm7GtIsfOlhFJ9BDwjR0s_FTzf76fk7svn26vr_ub7129Xn256GLSsPVBQHLRQYjBmQI1gDF-wYQTBjVRoJAUuQUALJf1Ily2HhnJFJXotxCn5uNPdzOMaF4CxZj-5TQ5rn3-55IP7uxLDvVuln05RISy1TYDtBKDM4DICZvD1iXi4bBenmjshmbCmcd7vm-b0Y8ZS3ToUwGnyEdNcGlYpqa2kskHf_QN9SHOO7Uu2KKmlYGpoKL5_RE6lZFweBmDUbf11O39d89c9-eseG-nty9EPlGc3G0DsAKWV4grzn97_kf0NhKGxeg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2064743165</pqid></control><display><type>article</type><title>Using machine learning tools for protein database biocuration assistance</title><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Nature Free</source><source>Recercat</source><source>PubMed Central</source><source>Springer Nature OA/Free Journals</source><source>Free Full-Text Journals in Chemistry</source><creator>König, Caroline ; Shaim, Ilmira ; Vellido, Alfredo ; Romero, Enrique ; Alquézar, René ; Giraldo, Jesús</creator><creatorcontrib>König, Caroline ; Shaim, Ilmira ; Vellido, Alfredo ; Romero, Enrique ; Alquézar, René ; Giraldo, Jesús</creatorcontrib><description>Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of
label noise
, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.</description><identifier>ISSN: 2045-2322</identifier><identifier>EISSN: 2045-2322</identifier><identifier>DOI: 10.1038/s41598-018-28330-z</identifier><identifier>PMID: 29977071</identifier><language>eng</language><publisher>London: Nature Publishing Group UK</publisher><subject>631/114/129/2044 ; 631/114/2164 ; Aplicacions de la informàtica ; Aprenentatge automàtic ; Artificial intelligence ; Biocuration ; Bioinformàtica ; Biological knowledge dissemination ; Cell interactions ; Cell membranes ; Data mining ; G protein-coupled receptors ; G Protein-Coupled Receptors (GPCRs) ; Humanities and Social Sciences ; Information retrieval ; Informàtica ; Learning algorithms ; Machine learning ; Mineria de dades ; multidisciplinary ; Omics sciences ; Pharmacology ; Proteins ; Proteomics ; Proteòmica ; Recuperació de la informació ; Science ; Science (multidisciplinary) ; Sistemes d'informació ; Àrees temàtiques de la UPC</subject><ispartof>Scientific reports, 2018-07, Vol.8 (1), p.10148-10, Article 10148</ispartof><rights>The Author(s) 2018</rights><rights>Copyright Nature Publishing Group Jul 2018</rights><rights>Attribution 3.0 Spain info:eu-repo/semantics/openAccess <a href="http://creativecommons.org/licenses/by/3.0/es/">http://creativecommons.org/licenses/by/3.0/es/</a></rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</citedby><cites>FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</cites><orcidid>0000-0002-6420-0517 ; 0000-0002-9843-1911</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC6033909/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC6033909/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,315,728,781,785,865,886,26979,27929,27930,41125,42194,51581,53796,53798</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/29977071$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>König, Caroline</creatorcontrib><creatorcontrib>Shaim, Ilmira</creatorcontrib><creatorcontrib>Vellido, Alfredo</creatorcontrib><creatorcontrib>Romero, Enrique</creatorcontrib><creatorcontrib>Alquézar, René</creatorcontrib><creatorcontrib>Giraldo, Jesús</creatorcontrib><title>Using machine learning tools for protein database biocuration assistance</title><title>Scientific reports</title><addtitle>Sci Rep</addtitle><addtitle>Sci Rep</addtitle><description>Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of
label noise
, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.</description><subject>631/114/129/2044</subject><subject>631/114/2164</subject><subject>Aplicacions de la informàtica</subject><subject>Aprenentatge automàtic</subject><subject>Artificial intelligence</subject><subject>Biocuration</subject><subject>Bioinformàtica</subject><subject>Biological knowledge dissemination</subject><subject>Cell interactions</subject><subject>Cell membranes</subject><subject>Data mining</subject><subject>G protein-coupled receptors</subject><subject>G Protein-Coupled Receptors (GPCRs)</subject><subject>Humanities and Social Sciences</subject><subject>Information retrieval</subject><subject>Informàtica</subject><subject>Learning algorithms</subject><subject>Machine learning</subject><subject>Mineria de dades</subject><subject>multidisciplinary</subject><subject>Omics sciences</subject><subject>Pharmacology</subject><subject>Proteins</subject><subject>Proteomics</subject><subject>Proteòmica</subject><subject>Recuperació de la informació</subject><subject>Science</subject><subject>Science (multidisciplinary)</subject><subject>Sistemes d'informació</subject><subject>Àrees temàtiques de la UPC</subject><issn>2045-2322</issn><issn>2045-2322</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>XX2</sourceid><recordid>eNp9UU1vEzEQtRAVrdr-gR7QSly4bPHX-uOChCqgSJW4tGfLO5mkrjZ2sL1I9NfjNGkoHBjJsj3z3huPHyEXjF4yKsyHItlgTU-Z6bkRgvaPr8gJp3LoueD89YvzMTkv5YG2GLiVzL4hx9xaralmJ-T6roS46tYe7kPEbkKf4zZRU5pKt0y52-RUMcRu4asffcFuDAnm7GtIsfOlhFJ9BDwjR0s_FTzf76fk7svn26vr_ub7129Xn256GLSsPVBQHLRQYjBmQI1gDF-wYQTBjVRoJAUuQUALJf1Ily2HhnJFJXotxCn5uNPdzOMaF4CxZj-5TQ5rn3-55IP7uxLDvVuln05RISy1TYDtBKDM4DICZvD1iXi4bBenmjshmbCmcd7vm-b0Y8ZS3ToUwGnyEdNcGlYpqa2kskHf_QN9SHOO7Uu2KKmlYGpoKL5_RE6lZFweBmDUbf11O39d89c9-eseG-nty9EPlGc3G0DsAKWV4grzn97_kf0NhKGxeg</recordid><startdate>20180705</startdate><enddate>20180705</enddate><creator>König, Caroline</creator><creator>Shaim, Ilmira</creator><creator>Vellido, Alfredo</creator><creator>Romero, Enrique</creator><creator>Alquézar, René</creator><creator>Giraldo, Jesús</creator><general>Nature Publishing Group UK</general><general>Nature Publishing Group</general><general>Nature</general><scope>C6C</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>88A</scope><scope>88E</scope><scope>88I</scope><scope>8FE</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>LK8</scope><scope>M0S</scope><scope>M1P</scope><scope>M2P</scope><scope>M7P</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><scope>7X8</scope><scope>XX2</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-6420-0517</orcidid><orcidid>https://orcid.org/0000-0002-9843-1911</orcidid></search><sort><creationdate>20180705</creationdate><title>Using machine learning tools for protein database biocuration assistance</title><author>König, Caroline ; Shaim, Ilmira ; Vellido, Alfredo ; Romero, Enrique ; Alquézar, René ; Giraldo, Jesús</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>631/114/129/2044</topic><topic>631/114/2164</topic><topic>Aplicacions de la informàtica</topic><topic>Aprenentatge automàtic</topic><topic>Artificial intelligence</topic><topic>Biocuration</topic><topic>Bioinformàtica</topic><topic>Biological knowledge dissemination</topic><topic>Cell interactions</topic><topic>Cell membranes</topic><topic>Data mining</topic><topic>G protein-coupled receptors</topic><topic>G Protein-Coupled Receptors (GPCRs)</topic><topic>Humanities and Social Sciences</topic><topic>Information retrieval</topic><topic>Informàtica</topic><topic>Learning algorithms</topic><topic>Machine learning</topic><topic>Mineria de dades</topic><topic>multidisciplinary</topic><topic>Omics sciences</topic><topic>Pharmacology</topic><topic>Proteins</topic><topic>Proteomics</topic><topic>Proteòmica</topic><topic>Recuperació de la informació</topic><topic>Science</topic><topic>Science (multidisciplinary)</topic><topic>Sistemes d'informació</topic><topic>Àrees temàtiques de la UPC</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>König, Caroline</creatorcontrib><creatorcontrib>Shaim, Ilmira</creatorcontrib><creatorcontrib>Vellido, Alfredo</creatorcontrib><creatorcontrib>Romero, Enrique</creatorcontrib><creatorcontrib>Alquézar, René</creatorcontrib><creatorcontrib>Giraldo, Jesús</creatorcontrib><collection>Springer Nature OA/Free Journals</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Biology Database (Alumni Edition)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Science Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Science Database</collection><collection>Biological Science Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><collection>MEDLINE - Academic</collection><collection>Recercat</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Scientific reports</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>König, Caroline</au><au>Shaim, Ilmira</au><au>Vellido, Alfredo</au><au>Romero, Enrique</au><au>Alquézar, René</au><au>Giraldo, Jesús</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Using machine learning tools for protein database biocuration assistance</atitle><jtitle>Scientific reports</jtitle><stitle>Sci Rep</stitle><addtitle>Sci Rep</addtitle><date>2018-07-05</date><risdate>2018</risdate><volume>8</volume><issue>1</issue><spage>10148</spage><epage>10</epage><pages>10148-10</pages><artnum>10148</artnum><issn>2045-2322</issn><eissn>2045-2322</eissn><abstract>Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of
label noise
, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.</abstract><cop>London</cop><pub>Nature Publishing Group UK</pub><pmid>29977071</pmid><doi>10.1038/s41598-018-28330-z</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-6420-0517</orcidid><orcidid>https://orcid.org/0000-0002-9843-1911</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2045-2322 |
ispartof | Scientific reports, 2018-07, Vol.8 (1), p.10148-10, Article 10148 |
issn | 2045-2322 2045-2322 |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_6033909 |
source | DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Nature Free; Recercat; PubMed Central; Springer Nature OA/Free Journals; Free Full-Text Journals in Chemistry |
subjects | 631/114/129/2044 631/114/2164 Aplicacions de la informàtica Aprenentatge automàtic Artificial intelligence Biocuration Bioinformàtica Biological knowledge dissemination Cell interactions Cell membranes Data mining G protein-coupled receptors G Protein-Coupled Receptors (GPCRs) Humanities and Social Sciences Information retrieval Informàtica Learning algorithms Machine learning Mineria de dades multidisciplinary Omics sciences Pharmacology Proteins Proteomics Proteòmica Recuperació de la informació Science Science (multidisciplinary) Sistemes d'informació Àrees temàtiques de la UPC |
title | Using machine learning tools for protein database biocuration assistance |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-15T19%3A31%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Using%20machine%20learning%20tools%20for%20protein%20database%20biocuration%20assistance&rft.jtitle=Scientific%20reports&rft.au=K%C3%B6nig,%20Caroline&rft.date=2018-07-05&rft.volume=8&rft.issue=1&rft.spage=10148&rft.epage=10&rft.pages=10148-10&rft.artnum=10148&rft.issn=2045-2322&rft.eissn=2045-2322&rft_id=info:doi/10.1038/s41598-018-28330-z&rft_dat=%3Cproquest_pubme%3E2064743165%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2064743165&rft_id=info:pmid/29977071&rfr_iscdi=true |