NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval

Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to p...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Katz, Uri, Vetzler, Matan, Cohen, Amir DN, Goldberg, Yoav
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Katz, Uri
Vetzler, Matan
Cohen, Amir DN
Goldberg, Yoav
description Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.
doi_str_mv 10.48550/arxiv.2310.14282
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_14282</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_14282</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-60f63e447cf2df112e7ba675abd275d803cebc12c0ab63c1567edb962ece07523</originalsourceid><addsrcrecordid>eNotj8FKAzEURbNxIdUPcGV-YGryMkmm7qSOVSgjlO6Hl-SlBNoZSUNp_946dnXhHjhwGHuSYl43WosXzOd0moO6HrKGBu5Z17UbKjnRiV75OxY8UuFxzLyjc-ErGihjSePAOzxQ4O1QUrnwDflxN6QJ4BD4TYH7B3YXcX-kx9vO2Paj3S4_q_X36mv5tq7QWKiMiEZRXVsfIUQpgay7Ao0ugNWhEcqT8xK8QGeUl9pYCm5hgDwJq0HN2PO_dgrqf3I6YL70f2H9FKZ-AWvUSPE</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval</title><source>arXiv.org</source><creator>Katz, Uri ; Vetzler, Matan ; Cohen, Amir DN ; Goldberg, Yoav</creator><creatorcontrib>Katz, Uri ; Vetzler, Matan ; Cohen, Amir DN ; Goldberg, Yoav</creatorcontrib><description>Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.</description><identifier>DOI: 10.48550/arxiv.2310.14282</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Information Retrieval</subject><creationdate>2023-10</creationdate><rights>http://creativecommons.org/licenses/by-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.14282$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.14282$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Katz, Uri</creatorcontrib><creatorcontrib>Vetzler, Matan</creatorcontrib><creatorcontrib>Cohen, Amir DN</creatorcontrib><creatorcontrib>Goldberg, Yoav</creatorcontrib><title>NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval</title><description>Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Information Retrieval</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FKAzEURbNxIdUPcGV-YGryMkmm7qSOVSgjlO6Hl-SlBNoZSUNp_946dnXhHjhwGHuSYl43WosXzOd0moO6HrKGBu5Z17UbKjnRiV75OxY8UuFxzLyjc-ErGihjSePAOzxQ4O1QUrnwDflxN6QJ4BD4TYH7B3YXcX-kx9vO2Paj3S4_q_X36mv5tq7QWKiMiEZRXVsfIUQpgay7Ao0ugNWhEcqT8xK8QGeUl9pYCm5hgDwJq0HN2PO_dgrqf3I6YL70f2H9FKZ-AWvUSPE</recordid><startdate>20231022</startdate><enddate>20231022</enddate><creator>Katz, Uri</creator><creator>Vetzler, Matan</creator><creator>Cohen, Amir DN</creator><creator>Goldberg, Yoav</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231022</creationdate><title>NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval</title><author>Katz, Uri ; Vetzler, Matan ; Cohen, Amir DN ; Goldberg, Yoav</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-60f63e447cf2df112e7ba675abd275d803cebc12c0ab63c1567edb962ece07523</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Information Retrieval</topic><toplevel>online_resources</toplevel><creatorcontrib>Katz, Uri</creatorcontrib><creatorcontrib>Vetzler, Matan</creatorcontrib><creatorcontrib>Cohen, Amir DN</creatorcontrib><creatorcontrib>Goldberg, Yoav</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Katz, Uri</au><au>Vetzler, Matan</au><au>Cohen, Amir DN</au><au>Goldberg, Yoav</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval</atitle><date>2023-10-22</date><risdate>2023</risdate><abstract>Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.</abstract><doi>10.48550/arxiv.2310.14282</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2310.14282
ispartof
issn
language eng
recordid cdi_arxiv_primary_2310_14282
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Information Retrieval
title NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-19T02%3A17%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=NERetrieve:%20Dataset%20for%20Next%20Generation%20Named%20Entity%20Recognition%20and%20Retrieval&rft.au=Katz,%20Uri&rft.date=2023-10-22&rft_id=info:doi/10.48550/arxiv.2310.14282&rft_dat=%3Carxiv_GOX%3E2310_14282%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true