NEREL-BIO: a dataset of biomedical abstracts annotated with nested named entities

Abstract Motivation This article describes NEREL-BIO—an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both gene...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Bioinformatics (Oxford, England) England), 2023-04, Vol.39 (4)
Hauptverfasser: Loukachevitch, Natalia, Manandhar, Suresh, Baral, Elina, Rozhkov, Igor, Braslavski, Pavel, Ivanov, Vladimir, Batura, Tatiana, Tutubalina, Elena
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 4
container_start_page
container_title Bioinformatics (Oxford, England)
container_volume 39
creator Loukachevitch, Natalia
Manandhar, Suresh
Baral, Elina
Rozhkov, Igor
Braslavski, Pavel
Ivanov, Vladimir
Batura, Tatiana
Tutubalina, Elena
description Abstract Motivation This article describes NEREL-BIO—an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect. Results NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL → NEREL-BIO) and cross-language (English → Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension models and report their results. Availability and implementation The dataset and annotation guidelines are freely available at https://github.com/nerel-ds/NEREL-BIO.
doi_str_mv 10.1093/bioinformatics/btad161
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10129873</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/bioinformatics/btad161</oup_id><sourcerecordid>2793985918</sourcerecordid><originalsourceid>FETCH-LOGICAL-c457t-4c5e3ab0c71d31fd326859b0a19e6131830617882c9b35bf52b6dab94e74909a3</originalsourceid><addsrcrecordid>eNqNUU1LAzEUDKJorf4FydHL2rzNfsWLaKlaKBZFz-Elm9XIdlM3qeK_N9Iq9ebpDbx5M_MYQk6AnQETfKSss13j-gUGq_1IBayhgB0yAF6USVYB7G7hA3Lo_StjLGd5sU8OeMlYBpUYkPu7ycNkllxN5-cUaY0BvQnUNTQaLExtNbYUlQ896uApdp0LGExNP2x4oZ3x37jDyKSmCzZY44_IXoOtN8ebOSRP15PH8W0ym99Mx5ezRGd5GZJM54ajYrqEmkNT87SocqEYgjAFcKg4K6CsqlQLxXPV5KkqalQiM2UmmEA-JBdr3eVKRX8d_Xts5bK3C-w_pUMr_246-yKf3bsEBqmoSh4VTjcKvXtbxV_kwnpt2hY741ZepqXgIoaKWYakWFN177zvTfPrA0x-FyL_FiI3hcTDk-2Uv2c_DUQCrAlutfyv6BfLRp7q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2793985918</pqid></control><display><type>article</type><title>NEREL-BIO: a dataset of biomedical abstracts annotated with nested named entities</title><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>Oxford Journals Open Access Collection</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Alma/SFX Local Collection</source><creator>Loukachevitch, Natalia ; Manandhar, Suresh ; Baral, Elina ; Rozhkov, Igor ; Braslavski, Pavel ; Ivanov, Vladimir ; Batura, Tatiana ; Tutubalina, Elena</creator><contributor>Lu, Zhiyong</contributor><creatorcontrib>Loukachevitch, Natalia ; Manandhar, Suresh ; Baral, Elina ; Rozhkov, Igor ; Braslavski, Pavel ; Ivanov, Vladimir ; Batura, Tatiana ; Tutubalina, Elena ; Lu, Zhiyong</creatorcontrib><description>Abstract Motivation This article describes NEREL-BIO—an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect. Results NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL → NEREL-BIO) and cross-language (English → Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension models and report their results. Availability and implementation The dataset and annotation guidelines are freely available at https://github.com/nerel-ds/NEREL-BIO.</description><identifier>ISSN: 1367-4811</identifier><identifier>ISSN: 1367-4803</identifier><identifier>EISSN: 1367-4811</identifier><identifier>DOI: 10.1093/bioinformatics/btad161</identifier><identifier>PMID: 37004189</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Language ; Natural Language Processing ; Original Paper ; PubMed ; Semantics</subject><ispartof>Bioinformatics (Oxford, England), 2023-04, Vol.39 (4)</ispartof><rights>The Author(s) 2023. Published by Oxford University Press. 2023</rights><rights>The Author(s) 2023. Published by Oxford University Press.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c457t-4c5e3ab0c71d31fd326859b0a19e6131830617882c9b35bf52b6dab94e74909a3</citedby><cites>FETCH-LOGICAL-c457t-4c5e3ab0c71d31fd326859b0a19e6131830617882c9b35bf52b6dab94e74909a3</cites><orcidid>0000-0001-7936-0284 ; 0000-0003-4333-7888</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10129873/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC10129873/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,864,885,1603,27923,27924,53790,53792</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/37004189$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><contributor>Lu, Zhiyong</contributor><creatorcontrib>Loukachevitch, Natalia</creatorcontrib><creatorcontrib>Manandhar, Suresh</creatorcontrib><creatorcontrib>Baral, Elina</creatorcontrib><creatorcontrib>Rozhkov, Igor</creatorcontrib><creatorcontrib>Braslavski, Pavel</creatorcontrib><creatorcontrib>Ivanov, Vladimir</creatorcontrib><creatorcontrib>Batura, Tatiana</creatorcontrib><creatorcontrib>Tutubalina, Elena</creatorcontrib><title>NEREL-BIO: a dataset of biomedical abstracts annotated with nested named entities</title><title>Bioinformatics (Oxford, England)</title><addtitle>Bioinformatics</addtitle><description>Abstract Motivation This article describes NEREL-BIO—an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect. Results NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL → NEREL-BIO) and cross-language (English → Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension models and report their results. Availability and implementation The dataset and annotation guidelines are freely available at https://github.com/nerel-ds/NEREL-BIO.</description><subject>Language</subject><subject>Natural Language Processing</subject><subject>Original Paper</subject><subject>PubMed</subject><subject>Semantics</subject><issn>1367-4811</issn><issn>1367-4803</issn><issn>1367-4811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>TOX</sourceid><sourceid>EIF</sourceid><recordid>eNqNUU1LAzEUDKJorf4FydHL2rzNfsWLaKlaKBZFz-Elm9XIdlM3qeK_N9Iq9ebpDbx5M_MYQk6AnQETfKSss13j-gUGq_1IBayhgB0yAF6USVYB7G7hA3Lo_StjLGd5sU8OeMlYBpUYkPu7ycNkllxN5-cUaY0BvQnUNTQaLExtNbYUlQ896uApdp0LGExNP2x4oZ3x37jDyKSmCzZY44_IXoOtN8ebOSRP15PH8W0ym99Mx5ezRGd5GZJM54ajYrqEmkNT87SocqEYgjAFcKg4K6CsqlQLxXPV5KkqalQiM2UmmEA-JBdr3eVKRX8d_Xts5bK3C-w_pUMr_246-yKf3bsEBqmoSh4VTjcKvXtbxV_kwnpt2hY741ZepqXgIoaKWYakWFN177zvTfPrA0x-FyL_FiI3hcTDk-2Uv2c_DUQCrAlutfyv6BfLRp7q</recordid><startdate>20230403</startdate><enddate>20230403</enddate><creator>Loukachevitch, Natalia</creator><creator>Manandhar, Suresh</creator><creator>Baral, Elina</creator><creator>Rozhkov, Igor</creator><creator>Braslavski, Pavel</creator><creator>Ivanov, Vladimir</creator><creator>Batura, Tatiana</creator><creator>Tutubalina, Elena</creator><general>Oxford University Press</general><scope>TOX</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0001-7936-0284</orcidid><orcidid>https://orcid.org/0000-0003-4333-7888</orcidid></search><sort><creationdate>20230403</creationdate><title>NEREL-BIO: a dataset of biomedical abstracts annotated with nested named entities</title><author>Loukachevitch, Natalia ; Manandhar, Suresh ; Baral, Elina ; Rozhkov, Igor ; Braslavski, Pavel ; Ivanov, Vladimir ; Batura, Tatiana ; Tutubalina, Elena</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c457t-4c5e3ab0c71d31fd326859b0a19e6131830617882c9b35bf52b6dab94e74909a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Language</topic><topic>Natural Language Processing</topic><topic>Original Paper</topic><topic>PubMed</topic><topic>Semantics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Loukachevitch, Natalia</creatorcontrib><creatorcontrib>Manandhar, Suresh</creatorcontrib><creatorcontrib>Baral, Elina</creatorcontrib><creatorcontrib>Rozhkov, Igor</creatorcontrib><creatorcontrib>Braslavski, Pavel</creatorcontrib><creatorcontrib>Ivanov, Vladimir</creatorcontrib><creatorcontrib>Batura, Tatiana</creatorcontrib><creatorcontrib>Tutubalina, Elena</creatorcontrib><collection>Oxford Journals Open Access Collection</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Bioinformatics (Oxford, England)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Loukachevitch, Natalia</au><au>Manandhar, Suresh</au><au>Baral, Elina</au><au>Rozhkov, Igor</au><au>Braslavski, Pavel</au><au>Ivanov, Vladimir</au><au>Batura, Tatiana</au><au>Tutubalina, Elena</au><au>Lu, Zhiyong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>NEREL-BIO: a dataset of biomedical abstracts annotated with nested named entities</atitle><jtitle>Bioinformatics (Oxford, England)</jtitle><addtitle>Bioinformatics</addtitle><date>2023-04-03</date><risdate>2023</risdate><volume>39</volume><issue>4</issue><issn>1367-4811</issn><issn>1367-4803</issn><eissn>1367-4811</eissn><abstract>Abstract Motivation This article describes NEREL-BIO—an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect. Results NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL → NEREL-BIO) and cross-language (English → Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension models and report their results. Availability and implementation The dataset and annotation guidelines are freely available at https://github.com/nerel-ds/NEREL-BIO.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>37004189</pmid><doi>10.1093/bioinformatics/btad161</doi><orcidid>https://orcid.org/0000-0001-7936-0284</orcidid><orcidid>https://orcid.org/0000-0003-4333-7888</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1367-4811
ispartof Bioinformatics (Oxford, England), 2023-04, Vol.39 (4)
issn 1367-4811
1367-4803
1367-4811
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10129873
source MEDLINE; DOAJ Directory of Open Access Journals; Oxford Journals Open Access Collection; EZB-FREE-00999 freely available EZB journals; PubMed Central; Alma/SFX Local Collection
subjects Language
Natural Language Processing
Original Paper
PubMed
Semantics
title NEREL-BIO: a dataset of biomedical abstracts annotated with nested named entities
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T06%3A33%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=NEREL-BIO:%20a%20dataset%20of%20biomedical%20abstracts%20annotated%20with%20nested%20named%20entities&rft.jtitle=Bioinformatics%20(Oxford,%20England)&rft.au=Loukachevitch,%20Natalia&rft.date=2023-04-03&rft.volume=39&rft.issue=4&rft.issn=1367-4811&rft.eissn=1367-4811&rft_id=info:doi/10.1093/bioinformatics/btad161&rft_dat=%3Cproquest_pubme%3E2793985918%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2793985918&rft_id=info:pmid/37004189&rft_oup_id=10.1093/bioinformatics/btad161&rfr_iscdi=true