BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale

A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PLoS computational biology 2020-04, Vol.16 (4), p.e1007617
Hauptverfasser:	Chen, Qingyu, Lee, Kyubum, Yan, Shankai, Kim, Sun, Wei, Chih-Hsuan, Lu, Zhiyong
Format:	Artikel
Sprache:	eng
Schlagworte:	Bioinformatics Biology and Life Sciences Biotechnology Computational biology Computer and Information Sciences Concept mapping Datasets Drug interaction Drug interactions Electronic health records Evaluation Heart failure Influence Kinases Knowledge Learning algorithms Machine learning Medical literature Medicine Medicine and Health Sciences Methods Mutation National libraries OLE (Standard) Performance enhancement Principal components analysis Protein interaction Proteins Recognition Semantics Social Sciences Software Studies
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	4
container_start_page	e1007617
container_title	PLoS computational biology
container_volume	16
creator	Chen, Qingyu Lee, Kyubum Yan, Shankai Kim, Sun Wei, Chih-Hsuan Lu, Zhiyong
description	A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings-which involve the learning of vector representations of concepts using machine learning models-have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec.
doi_str_mv	10.1371/journal.pcbi.1007617
format	Article
fullrecord	<record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_2403774308</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A632940667</galeid><doaj_id>oai_doaj_org_article_672492f8e3f84727b53d4b19eac823a1</doaj_id><sourcerecordid>A632940667</sourcerecordid><originalsourceid>FETCH-LOGICAL-c622t-b2961ad5e5213952fc4e99995438ad29a9048da2c09e7a365989e0349e17393a3</originalsourceid><addsrcrecordid>eNqVkk1v1DAQhiMEoqXwDxBE4sQhi-1x4pgDUlnxsVIFEl9Xa-JMglfZeGsnFfx7XDatugcO-OKx_czr8evJsqecrTgo_mrr5zDisNrbxq04Y6ri6l52yssSCgVlff9OfJI9inHLWAp19TA7AQFCKuCnWffW-bUfLe2nH2Rf5-tAOLmxz3Fsc7rCYT4sBzdRwGkOVDQYqc0b53fUOotDbg_5Oe0aattEx9yPOeYDhp7ymBB6nD3ocIj0ZJnPsu_v331bfywuPn_YrM8vClsJMRWN0BXHtqRScNCl6KwknUYpocZWaNRM1i0KyzQphKrUtSYGUhNXoAHhLHt-0N0PPprFomiEZKCUBFYnYnMgWo9bsw9uh-G38ejM3w0feoNhcnYgUykhtehqgq6WSqimhFY2XBPaWgDypPVmuW1ukhmWxingcCR6fDK6n6b3V0YJUAxYEnixCAR_OVOc_lHyQvXJSePGzicxu3PRmvMKhJasqlSiXh5R6Vcm-jX1OMdoNl-__Af76ZiVB9YGH2Og7vZ9nJnrTryp2Vx3olk6MaU9u-vNbdJN68Ef1djZEw</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2403774308</pqid></control><display><type>article</type><title>BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale</title><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>PubMed Central</source><source>Public Library of Science (PLoS)</source><creator>Chen, Qingyu ; Lee, Kyubum ; Yan, Shankai ; Kim, Sun ; Wei, Chih-Hsuan ; Lu, Zhiyong</creator><creatorcontrib>Chen, Qingyu ; Lee, Kyubum ; Yan, Shankai ; Kim, Sun ; Wei, Chih-Hsuan ; Lu, Zhiyong</creatorcontrib><description>A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings-which involve the learning of vector representations of concepts using machine learning models-have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec.</description><identifier>ISSN: 1553-7358</identifier><identifier>ISSN: 1553-734X</identifier><identifier>EISSN: 1553-7358</identifier><identifier>DOI: 10.1371/journal.pcbi.1007617</identifier><identifier>PMID: 32324731</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Bioinformatics ; Biology and Life Sciences ; Biotechnology ; Computational biology ; Computer and Information Sciences ; Concept mapping ; Datasets ; Drug interaction ; Drug interactions ; Electronic health records ; Evaluation ; Heart failure ; Influence ; Kinases ; Knowledge ; Learning algorithms ; Machine learning ; Medical literature ; Medicine ; Medicine and Health Sciences ; Methods ; Mutation ; National libraries ; OLE (Standard) ; Performance enhancement ; Principal components analysis ; Protein interaction ; Proteins ; Recognition ; Semantics ; Social Sciences ; Software ; Studies</subject><ispartof>PLoS computational biology, 2020-04, Vol.16 (4), p.e1007617</ispartof><rights>COPYRIGHT 2020 Public Library of Science</rights><rights>This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication: https://creativecommons.org/publicdomain/zero/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c622t-b2961ad5e5213952fc4e99995438ad29a9048da2c09e7a365989e0349e17393a3</citedby><cites>FETCH-LOGICAL-c622t-b2961ad5e5213952fc4e99995438ad29a9048da2c09e7a365989e0349e17393a3</cites><orcidid>0000-0002-6036-1516 ; 0000-0003-2015-3939 ; 0000-0003-0369-4979 ; 0000-0001-9998-916X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7237030/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7237030/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,2096,2915,23845,27901,27902,53766,53768,79569,79570</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32324731$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Qingyu</creatorcontrib><creatorcontrib>Lee, Kyubum</creatorcontrib><creatorcontrib>Yan, Shankai</creatorcontrib><creatorcontrib>Kim, Sun</creatorcontrib><creatorcontrib>Wei, Chih-Hsuan</creatorcontrib><creatorcontrib>Lu, Zhiyong</creatorcontrib><title>BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale</title><title>PLoS computational biology</title><addtitle>PLoS Comput Biol</addtitle><description>A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings-which involve the learning of vector representations of concepts using machine learning models-have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec.</description><subject>Bioinformatics</subject><subject>Biology and Life Sciences</subject><subject>Biotechnology</subject><subject>Computational biology</subject><subject>Computer and Information Sciences</subject><subject>Concept mapping</subject><subject>Datasets</subject><subject>Drug interaction</subject><subject>Drug interactions</subject><subject>Electronic health records</subject><subject>Evaluation</subject><subject>Heart failure</subject><subject>Influence</subject><subject>Kinases</subject><subject>Knowledge</subject><subject>Learning algorithms</subject><subject>Machine learning</subject><subject>Medical literature</subject><subject>Medicine</subject><subject>Medicine and Health Sciences</subject><subject>Methods</subject><subject>Mutation</subject><subject>National libraries</subject><subject>OLE (Standard)</subject><subject>Performance enhancement</subject><subject>Principal components analysis</subject><subject>Protein interaction</subject><subject>Proteins</subject><subject>Recognition</subject><subject>Semantics</subject><subject>Social Sciences</subject><subject>Software</subject><subject>Studies</subject><issn>1553-7358</issn><issn>1553-734X</issn><issn>1553-7358</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><sourceid>DOA</sourceid><recordid>eNqVkk1v1DAQhiMEoqXwDxBE4sQhi-1x4pgDUlnxsVIFEl9Xa-JMglfZeGsnFfx7XDatugcO-OKx_czr8evJsqecrTgo_mrr5zDisNrbxq04Y6ri6l52yssSCgVlff9OfJI9inHLWAp19TA7AQFCKuCnWffW-bUfLe2nH2Rf5-tAOLmxz3Fsc7rCYT4sBzdRwGkOVDQYqc0b53fUOotDbg_5Oe0aattEx9yPOeYDhp7ymBB6nD3ocIj0ZJnPsu_v331bfywuPn_YrM8vClsJMRWN0BXHtqRScNCl6KwknUYpocZWaNRM1i0KyzQphKrUtSYGUhNXoAHhLHt-0N0PPprFomiEZKCUBFYnYnMgWo9bsw9uh-G38ejM3w0feoNhcnYgUykhtehqgq6WSqimhFY2XBPaWgDypPVmuW1ukhmWxingcCR6fDK6n6b3V0YJUAxYEnixCAR_OVOc_lHyQvXJSePGzicxu3PRmvMKhJasqlSiXh5R6Vcm-jX1OMdoNl-__Af76ZiVB9YGH2Og7vZ9nJnrTryp2Vx3olk6MaU9u-vNbdJN68Ef1djZEw</recordid><startdate>20200401</startdate><enddate>20200401</enddate><creator>Chen, Qingyu</creator><creator>Lee, Kyubum</creator><creator>Yan, Shankai</creator><creator>Kim, Sun</creator><creator>Wei, Chih-Hsuan</creator><creator>Lu, Zhiyong</creator><general>Public Library of Science</general><general>Public Library of Science (PLoS)</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISN</scope><scope>ISR</scope><scope>3V.</scope><scope>7QO</scope><scope>7QP</scope><scope>7TK</scope><scope>7TM</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>K9.</scope><scope>LK8</scope><scope>M0N</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PIMPY</scope><scope>PJZUB</scope><scope>PKEHL</scope><scope>PPXIY</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>Q9U</scope><scope>RC3</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-6036-1516</orcidid><orcidid>https://orcid.org/0000-0003-2015-3939</orcidid><orcidid>https://orcid.org/0000-0003-0369-4979</orcidid><orcidid>https://orcid.org/0000-0001-9998-916X</orcidid></search><sort><creationdate>20200401</creationdate><title>BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale</title><author>Chen, Qingyu ; Lee, Kyubum ; Yan, Shankai ; Kim, Sun ; Wei, Chih-Hsuan ; Lu, Zhiyong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c622t-b2961ad5e5213952fc4e99995438ad29a9048da2c09e7a365989e0349e17393a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Bioinformatics</topic><topic>Biology and Life Sciences</topic><topic>Biotechnology</topic><topic>Computational biology</topic><topic>Computer and Information Sciences</topic><topic>Concept mapping</topic><topic>Datasets</topic><topic>Drug interaction</topic><topic>Drug interactions</topic><topic>Electronic health records</topic><topic>Evaluation</topic><topic>Heart failure</topic><topic>Influence</topic><topic>Kinases</topic><topic>Knowledge</topic><topic>Learning algorithms</topic><topic>Machine learning</topic><topic>Medical literature</topic><topic>Medicine</topic><topic>Medicine and Health Sciences</topic><topic>Methods</topic><topic>Mutation</topic><topic>National libraries</topic><topic>OLE (Standard)</topic><topic>Performance enhancement</topic><topic>Principal components analysis</topic><topic>Protein interaction</topic><topic>Proteins</topic><topic>Recognition</topic><topic>Semantics</topic><topic>Social Sciences</topic><topic>Software</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Chen, Qingyu</creatorcontrib><creatorcontrib>Lee, Kyubum</creatorcontrib><creatorcontrib>Yan, Shankai</creatorcontrib><creatorcontrib>Kim, Sun</creatorcontrib><creatorcontrib>Wei, Chih-Hsuan</creatorcontrib><creatorcontrib>Lu, Zhiyong</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Canada</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium & Calcified Tissue Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Computing Database</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Biological Science Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest Health & Medical Research Collection</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Health & Nursing</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied & Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest Central Basic</collection><collection>Genetics Abstracts</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>PLoS computational biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chen, Qingyu</au><au>Lee, Kyubum</au><au>Yan, Shankai</au><au>Kim, Sun</au><au>Wei, Chih-Hsuan</au><au>Lu, Zhiyong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale</atitle><jtitle>PLoS computational biology</jtitle><addtitle>PLoS Comput Biol</addtitle><date>2020-04-01</date><risdate>2020</risdate><volume>16</volume><issue>4</issue><spage>e1007617</spage><pages>e1007617-</pages><issn>1553-7358</issn><issn>1553-734X</issn><eissn>1553-7358</eissn><abstract>A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings-which involve the learning of vector representations of concepts using machine learning models-have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>32324731</pmid><doi>10.1371/journal.pcbi.1007617</doi><orcidid>https://orcid.org/0000-0002-6036-1516</orcidid><orcidid>https://orcid.org/0000-0003-2015-3939</orcidid><orcidid>https://orcid.org/0000-0003-0369-4979</orcidid><orcidid>https://orcid.org/0000-0001-9998-916X</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1553-7358
ispartof	PLoS computational biology, 2020-04, Vol.16 (4), p.e1007617
issn	1553-7358 1553-734X 1553-7358
language	eng
recordid	cdi_plos_journals_2403774308
source	DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; PubMed Central; Public Library of Science (PLoS)
subjects	Bioinformatics Biology and Life Sciences Biotechnology Computational biology Computer and Information Sciences Concept mapping Datasets Drug interaction Drug interactions Electronic health records Evaluation Heart failure Influence Kinases Knowledge Learning algorithms Machine learning Medical literature Medicine Medicine and Health Sciences Methods Mutation National libraries OLE (Standard) Performance enhancement Principal components analysis Protein interaction Proteins Recognition Semantics Social Sciences Software Studies
title	BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-15T19%3A58%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=BioConceptVec:%20Creating%20and%20evaluating%20literature-based%20biomedical%20concept%20embeddings%20on%20a%20large%20scale&rft.jtitle=PLoS%20computational%20biology&rft.au=Chen,%20Qingyu&rft.date=2020-04-01&rft.volume=16&rft.issue=4&rft.spage=e1007617&rft.pages=e1007617-&rft.issn=1553-7358&rft.eissn=1553-7358&rft_id=info:doi/10.1371/journal.pcbi.1007617&rft_dat=%3Cgale_plos_%3EA632940667%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2403774308&rft_id=info:pmid/32324731&rft_galeid=A632940667&rft_doaj_id=oai_doaj_org_article_672492f8e3f84727b53d4b19eac823a1&rfr_iscdi=true