NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval

Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to p...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Katz, Uri, Vetzler, Matan, Cohen, Amir DN, Goldberg, Yoav
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Information Retrieval
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Katz, Uri Vetzler, Matan Cohen, Amir DN Goldberg, Yoav
description	Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.
doi_str_mv	10.48550/arxiv.2310.14282
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_14282</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_14282</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-60f63e447cf2df112e7ba675abd275d803cebc12c0ab63c1567edb962ece07523</originalsourceid><addsrcrecordid>eNotj8FKAzEURbNxIdUPcGV-YGryMkmm7qSOVSgjlO6Hl-SlBNoZSUNp_946dnXhHjhwGHuSYl43WosXzOd0moO6HrKGBu5Z17UbKjnRiV75OxY8UuFxzLyjc-ErGihjSePAOzxQ4O1QUrnwDflxN6QJ4BD4TYH7B3YXcX-kx9vO2Paj3S4_q_X36mv5tq7QWKiMiEZRXVsfIUQpgay7Ao0ugNWhEcqT8xK8QGeUl9pYCm5hgDwJq0HN2PO_dgrqf3I6YL70f2H9FKZ-AWvUSPE</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval</title><source>arXiv.org</source><creator>Katz, Uri ; Vetzler, Matan ; Cohen, Amir DN ; Goldberg, Yoav</creator><creatorcontrib>Katz, Uri ; Vetzler, Matan ; Cohen, Amir DN ; Goldberg, Yoav</creatorcontrib><description>Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.</description><identifier>DOI: 10.48550/arxiv.2310.14282</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Information Retrieval</subject><creationdate>2023-10</creationdate><rights>http://creativecommons.org/licenses/by-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.14282$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.14282$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Katz, Uri</creatorcontrib><creatorcontrib>Vetzler, Matan</creatorcontrib><creatorcontrib>Cohen, Amir DN</creatorcontrib><creatorcontrib>Goldberg, Yoav</creatorcontrib><title>NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval</title><description>Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Information Retrieval</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FKAzEURbNxIdUPcGV-YGryMkmm7qSOVSgjlO6Hl-SlBNoZSUNp_946dnXhHjhwGHuSYl43WosXzOd0moO6HrKGBu5Z17UbKjnRiV75OxY8UuFxzLyjc-ErGihjSePAOzxQ4O1QUrnwDflxN6QJ4BD4TYH7B3YXcX-kx9vO2Paj3S4_q_X36mv5tq7QWKiMiEZRXVsfIUQpgay7Ao0ugNWhEcqT8xK8QGeUl9pYCm5hgDwJq0HN2PO_dgrqf3I6YL70f2H9FKZ-AWvUSPE</recordid><startdate>20231022</startdate><enddate>20231022</enddate><creator>Katz, Uri</creator><creator>Vetzler, Matan</creator><creator>Cohen, Amir DN</creator><creator>Goldberg, Yoav</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231022</creationdate><title>NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval</title><author>Katz, Uri ; Vetzler, Matan ; Cohen, Amir DN ; Goldberg, Yoav</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-60f63e447cf2df112e7ba675abd275d803cebc12c0ab63c1567edb962ece07523</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Information Retrieval</topic><toplevel>online_resources</toplevel><creatorcontrib>Katz, Uri</creatorcontrib><creatorcontrib>Vetzler, Matan</creatorcontrib><creatorcontrib>Cohen, Amir DN</creatorcontrib><creatorcontrib>Goldberg, Yoav</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Katz, Uri</au><au>Vetzler, Matan</au><au>Cohen, Amir DN</au><au>Goldberg, Yoav</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval</atitle><date>2023-10-22</date><risdate>2023</risdate><abstract>Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.</abstract><doi>10.48550/arxiv.2310.14282</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2310.14282
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2310_14282
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Information Retrieval
title	NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-19T02%3A17%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=NERetrieve:%20Dataset%20for%20Next%20Generation%20Named%20Entity%20Recognition%20and%20Retrieval&rft.au=Katz,%20Uri&rft.date=2023-10-22&rft_id=info:doi/10.48550/arxiv.2310.14282&rft_dat=%3Carxiv_GOX%3E2310_14282%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true