Using machine learning tools for protein database biocuration assistance

Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challe...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Scientific reports 2018-07, Vol.8 (1), p.10148-10, Article 10148
Hauptverfasser:	König, Caroline, Shaim, Ilmira, Vellido, Alfredo, Romero, Enrique, Alquézar, René, Giraldo, Jesús
Format:	Artikel
Sprache:	eng
Schlagworte:	631/114/129/2044 631/114/2164 Aplicacions de la informàtica Aprenentatge automàtic Artificial intelligence Biocuration Bioinformàtica Biological knowledge dissemination Cell interactions Cell membranes Data mining G protein-coupled receptors G Protein-Coupled Receptors (GPCRs) Humanities and Social Sciences Information retrieval Informàtica Learning algorithms Machine learning Mineria de dades multidisciplinary Omics sciences Pharmacology Proteins Proteomics Proteòmica Recuperació de la informació Science Science (multidisciplinary) Sistemes d'informació Àrees temàtiques de la UPC
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	10
container_issue	1
container_start_page	10148
container_title	Scientific reports
container_volume	8
creator	König, Caroline Shaim, Ilmira Vellido, Alfredo Romero, Enrique Alquézar, René Giraldo, Jesús
description	Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise , as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.
doi_str_mv	10.1038/s41598-018-28330-z
format	Article
fullrecord	<record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_6033909</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2064743165</sourcerecordid><originalsourceid>FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</originalsourceid><addsrcrecordid>eNp9UU1vEzEQtRAVrdr-gR7QSly4bPHX-uOChCqgSJW4tGfLO5mkrjZ2sL1I9NfjNGkoHBjJsj3z3huPHyEXjF4yKsyHItlgTU-Z6bkRgvaPr8gJp3LoueD89YvzMTkv5YG2GLiVzL4hx9xaralmJ-T6roS46tYe7kPEbkKf4zZRU5pKt0y52-RUMcRu4asffcFuDAnm7GtIsfOlhFJ9BDwjR0s_FTzf76fk7svn26vr_ub7129Xn256GLSsPVBQHLRQYjBmQI1gDF-wYQTBjVRoJAUuQUALJf1Ily2HhnJFJXotxCn5uNPdzOMaF4CxZj-5TQ5rn3-55IP7uxLDvVuln05RISy1TYDtBKDM4DICZvD1iXi4bBenmjshmbCmcd7vm-b0Y8ZS3ToUwGnyEdNcGlYpqa2kskHf_QN9SHOO7Uu2KKmlYGpoKL5_RE6lZFweBmDUbf11O39d89c9-eseG-nty9EPlGc3G0DsAKWV4grzn97_kf0NhKGxeg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2064743165</pqid></control><display><type>article</type><title>Using machine learning tools for protein database biocuration assistance</title><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Nature Free</source><source>Recercat</source><source>PubMed Central</source><source>Springer Nature OA/Free Journals</source><source>Free Full-Text Journals in Chemistry</source><creator>König, Caroline ; Shaim, Ilmira ; Vellido, Alfredo ; Romero, Enrique ; Alquézar, René ; Giraldo, Jesús</creator><creatorcontrib>König, Caroline ; Shaim, Ilmira ; Vellido, Alfredo ; Romero, Enrique ; Alquézar, René ; Giraldo, Jesús</creatorcontrib><description>Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise , as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.</description><identifier>ISSN: 2045-2322</identifier><identifier>EISSN: 2045-2322</identifier><identifier>DOI: 10.1038/s41598-018-28330-z</identifier><identifier>PMID: 29977071</identifier><language>eng</language><publisher>London: Nature Publishing Group UK</publisher><subject>631/114/129/2044 ; 631/114/2164 ; Aplicacions de la informàtica ; Aprenentatge automàtic ; Artificial intelligence ; Biocuration ; Bioinformàtica ; Biological knowledge dissemination ; Cell interactions ; Cell membranes ; Data mining ; G protein-coupled receptors ; G Protein-Coupled Receptors (GPCRs) ; Humanities and Social Sciences ; Information retrieval ; Informàtica ; Learning algorithms ; Machine learning ; Mineria de dades ; multidisciplinary ; Omics sciences ; Pharmacology ; Proteins ; Proteomics ; Proteòmica ; Recuperació de la informació ; Science ; Science (multidisciplinary) ; Sistemes d'informació ; Àrees temàtiques de la UPC</subject><ispartof>Scientific reports, 2018-07, Vol.8 (1), p.10148-10, Article 10148</ispartof><rights>The Author(s) 2018</rights><rights>Copyright Nature Publishing Group Jul 2018</rights><rights>Attribution 3.0 Spain info:eu-repo/semantics/openAccess <a href="http://creativecommons.org/licenses/by/3.0/es/">http://creativecommons.org/licenses/by/3.0/es/</a></rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</citedby><cites>FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</cites><orcidid>0000-0002-6420-0517 ; 0000-0002-9843-1911</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC6033909/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC6033909/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,315,728,781,785,865,886,26979,27929,27930,41125,42194,51581,53796,53798</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/29977071$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>König, Caroline</creatorcontrib><creatorcontrib>Shaim, Ilmira</creatorcontrib><creatorcontrib>Vellido, Alfredo</creatorcontrib><creatorcontrib>Romero, Enrique</creatorcontrib><creatorcontrib>Alquézar, René</creatorcontrib><creatorcontrib>Giraldo, Jesús</creatorcontrib><title>Using machine learning tools for protein database biocuration assistance</title><title>Scientific reports</title><addtitle>Sci Rep</addtitle><addtitle>Sci Rep</addtitle><description>Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise , as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.</description><subject>631/114/129/2044</subject><subject>631/114/2164</subject><subject>Aplicacions de la informàtica</subject><subject>Aprenentatge automàtic</subject><subject>Artificial intelligence</subject><subject>Biocuration</subject><subject>Bioinformàtica</subject><subject>Biological knowledge dissemination</subject><subject>Cell interactions</subject><subject>Cell membranes</subject><subject>Data mining</subject><subject>G protein-coupled receptors</subject><subject>G Protein-Coupled Receptors (GPCRs)</subject><subject>Humanities and Social Sciences</subject><subject>Information retrieval</subject><subject>Informàtica</subject><subject>Learning algorithms</subject><subject>Machine learning</subject><subject>Mineria de dades</subject><subject>multidisciplinary</subject><subject>Omics sciences</subject><subject>Pharmacology</subject><subject>Proteins</subject><subject>Proteomics</subject><subject>Proteòmica</subject><subject>Recuperació de la informació</subject><subject>Science</subject><subject>Science (multidisciplinary)</subject><subject>Sistemes d'informació</subject><subject>Àrees temàtiques de la UPC</subject><issn>2045-2322</issn><issn>2045-2322</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>XX2</sourceid><recordid>eNp9UU1vEzEQtRAVrdr-gR7QSly4bPHX-uOChCqgSJW4tGfLO5mkrjZ2sL1I9NfjNGkoHBjJsj3z3huPHyEXjF4yKsyHItlgTU-Z6bkRgvaPr8gJp3LoueD89YvzMTkv5YG2GLiVzL4hx9xaralmJ-T6roS46tYe7kPEbkKf4zZRU5pKt0y52-RUMcRu4asffcFuDAnm7GtIsfOlhFJ9BDwjR0s_FTzf76fk7svn26vr_ub7129Xn256GLSsPVBQHLRQYjBmQI1gDF-wYQTBjVRoJAUuQUALJf1Ily2HhnJFJXotxCn5uNPdzOMaF4CxZj-5TQ5rn3-55IP7uxLDvVuln05RISy1TYDtBKDM4DICZvD1iXi4bBenmjshmbCmcd7vm-b0Y8ZS3ToUwGnyEdNcGlYpqa2kskHf_QN9SHOO7Uu2KKmlYGpoKL5_RE6lZFweBmDUbf11O39d89c9-eseG-nty9EPlGc3G0DsAKWV4grzn97_kf0NhKGxeg</recordid><startdate>20180705</startdate><enddate>20180705</enddate><creator>König, Caroline</creator><creator>Shaim, Ilmira</creator><creator>Vellido, Alfredo</creator><creator>Romero, Enrique</creator><creator>Alquézar, René</creator><creator>Giraldo, Jesús</creator><general>Nature Publishing Group UK</general><general>Nature Publishing Group</general><general>Nature</general><scope>C6C</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>88A</scope><scope>88E</scope><scope>88I</scope><scope>8FE</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>LK8</scope><scope>M0S</scope><scope>M1P</scope><scope>M2P</scope><scope>M7P</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><scope>7X8</scope><scope>XX2</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-6420-0517</orcidid><orcidid>https://orcid.org/0000-0002-9843-1911</orcidid></search><sort><creationdate>20180705</creationdate><title>Using machine learning tools for protein database biocuration assistance</title><author>König, Caroline ; Shaim, Ilmira ; Vellido, Alfredo ; Romero, Enrique ; Alquézar, René ; Giraldo, Jesús</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c574t-c0c62c73635885e7ec882d15bc32846e840c24c3cccc64ab0f6e8e802604ea733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>631/114/129/2044</topic><topic>631/114/2164</topic><topic>Aplicacions de la informàtica</topic><topic>Aprenentatge automàtic</topic><topic>Artificial intelligence</topic><topic>Biocuration</topic><topic>Bioinformàtica</topic><topic>Biological knowledge dissemination</topic><topic>Cell interactions</topic><topic>Cell membranes</topic><topic>Data mining</topic><topic>G protein-coupled receptors</topic><topic>G Protein-Coupled Receptors (GPCRs)</topic><topic>Humanities and Social Sciences</topic><topic>Information retrieval</topic><topic>Informàtica</topic><topic>Learning algorithms</topic><topic>Machine learning</topic><topic>Mineria de dades</topic><topic>multidisciplinary</topic><topic>Omics sciences</topic><topic>Pharmacology</topic><topic>Proteins</topic><topic>Proteomics</topic><topic>Proteòmica</topic><topic>Recuperació de la informació</topic><topic>Science</topic><topic>Science (multidisciplinary)</topic><topic>Sistemes d'informació</topic><topic>Àrees temàtiques de la UPC</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>König, Caroline</creatorcontrib><creatorcontrib>Shaim, Ilmira</creatorcontrib><creatorcontrib>Vellido, Alfredo</creatorcontrib><creatorcontrib>Romero, Enrique</creatorcontrib><creatorcontrib>Alquézar, René</creatorcontrib><creatorcontrib>Giraldo, Jesús</creatorcontrib><collection>Springer Nature OA/Free Journals</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Biology Database (Alumni Edition)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Science Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Science Database</collection><collection>Biological Science Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><collection>MEDLINE - Academic</collection><collection>Recercat</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Scientific reports</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>König, Caroline</au><au>Shaim, Ilmira</au><au>Vellido, Alfredo</au><au>Romero, Enrique</au><au>Alquézar, René</au><au>Giraldo, Jesús</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Using machine learning tools for protein database biocuration assistance</atitle><jtitle>Scientific reports</jtitle><stitle>Sci Rep</stitle><addtitle>Sci Rep</addtitle><date>2018-07-05</date><risdate>2018</risdate><volume>8</volume><issue>1</issue><spage>10148</spage><epage>10</epage><pages>10148-10</pages><artnum>10148</artnum><issn>2045-2322</issn><eissn>2045-2322</eissn><abstract>Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise , as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.</abstract><cop>London</cop><pub>Nature Publishing Group UK</pub><pmid>29977071</pmid><doi>10.1038/s41598-018-28330-z</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-6420-0517</orcidid><orcidid>https://orcid.org/0000-0002-9843-1911</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2045-2322
ispartof	Scientific reports, 2018-07, Vol.8 (1), p.10148-10, Article 10148
issn	2045-2322 2045-2322
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_6033909
source	DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Nature Free; Recercat; PubMed Central; Springer Nature OA/Free Journals; Free Full-Text Journals in Chemistry
subjects	631/114/129/2044 631/114/2164 Aplicacions de la informàtica Aprenentatge automàtic Artificial intelligence Biocuration Bioinformàtica Biological knowledge dissemination Cell interactions Cell membranes Data mining G protein-coupled receptors G Protein-Coupled Receptors (GPCRs) Humanities and Social Sciences Information retrieval Informàtica Learning algorithms Machine learning Mineria de dades multidisciplinary Omics sciences Pharmacology Proteins Proteomics Proteòmica Recuperació de la informació Science Science (multidisciplinary) Sistemes d'informació Àrees temàtiques de la UPC
title	Using machine learning tools for protein database biocuration assistance
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-15T19%3A31%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Using%20machine%20learning%20tools%20for%20protein%20database%20biocuration%20assistance&rft.jtitle=Scientific%20reports&rft.au=K%C3%B6nig,%20Caroline&rft.date=2018-07-05&rft.volume=8&rft.issue=1&rft.spage=10148&rft.epage=10&rft.pages=10148-10&rft.artnum=10148&rft.issn=2045-2322&rft.eissn=2045-2322&rft_id=info:doi/10.1038/s41598-018-28330-z&rft_dat=%3Cproquest_pubme%3E2064743165%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2064743165&rft_id=info:pmid/29977071&rfr_iscdi=true