Named Entity Recognition in the Legal Domain using a Pointer Generator Network

Named Entity Recognition (NER) is the task of identifying and classifying named entities in unstructured text. In the legal domain, named entities of interest may include the case parties, judges, names of courts, case numbers, references to laws etc. We study the problem of legal NER with noisy tex...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2020-12
Hauptverfasser:	Skylaki, Stavroula, Oskooei, Ali, Bari, Omar, Herger, Nadja, Kriegman, Zac
Format:	Artikel
Sprache:	eng
Schlagworte:	Annotations Computer architecture Domains Legislation Neural networks Optical character recognition Standard data Training Unstructured data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Skylaki, Stavroula Oskooei, Ali Bari, Omar Herger, Nadja Kriegman, Zac
description	Named Entity Recognition (NER) is the task of identifying and classifying named entities in unstructured text. In the legal domain, named entities of interest may include the case parties, judges, names of courts, case numbers, references to laws etc. We study the problem of legal NER with noisy text extracted from PDF files of filed court cases from US courts. The "gold standard" training data for NER systems provide annotation for each token of the text with the corresponding entity or non-entity label. We work with only partially complete training data, which differ from the gold standard NER data in that the exact location of the entities in the text is unknown and the entities may contain typos and/or OCR mistakes. To overcome the challenges of our noisy training data, e.g. text extraction errors and/or typos and unknown label indices, we formulate the NER task as a text-to-text sequence generation task and train a pointer generator network to generate the entities in the document rather than label them. We show that the pointer generator can be effective for NER in the absence of gold standard data and outperforms the common NER neural network architectures in long legal documents.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2471580294</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2471580294</sourcerecordid><originalsourceid>FETCH-proquest_journals_24715802943</originalsourceid><addsrcrecordid>eNqNisEKgkAUAJcgSMp_eNBZWFdNO5fVISSiuyz1sjV9r3ZXor-vQx_QaWBmRiJQSRJHRarURITOtVJKtchVliWBqCrd4wVK8sa_4Yhnbsh4wwSGwN8Q9tjoDtbc668YnKEGNBzYkEcLWyS02rOFCv2L7X0mxlfdOQx_nIr5pjytdtHD8nNA5-uWB0vfVKs0j7NCqmWa_Hd9AInwPbg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2471580294</pqid></control><display><type>article</type><title>Named Entity Recognition in the Legal Domain using a Pointer Generator Network</title><source>Free E- Journals</source><creator>Skylaki, Stavroula ; Oskooei, Ali ; Bari, Omar ; Herger, Nadja ; Kriegman, Zac</creator><creatorcontrib>Skylaki, Stavroula ; Oskooei, Ali ; Bari, Omar ; Herger, Nadja ; Kriegman, Zac</creatorcontrib><description>Named Entity Recognition (NER) is the task of identifying and classifying named entities in unstructured text. In the legal domain, named entities of interest may include the case parties, judges, names of courts, case numbers, references to laws etc. We study the problem of legal NER with noisy text extracted from PDF files of filed court cases from US courts. The "gold standard" training data for NER systems provide annotation for each token of the text with the corresponding entity or non-entity label. We work with only partially complete training data, which differ from the gold standard NER data in that the exact location of the entities in the text is unknown and the entities may contain typos and/or OCR mistakes. To overcome the challenges of our noisy training data, e.g. text extraction errors and/or typos and unknown label indices, we formulate the NER task as a text-to-text sequence generation task and train a pointer generator network to generate the entities in the document rather than label them. We show that the pointer generator can be effective for NER in the absence of gold standard data and outperforms the common NER neural network architectures in long legal documents.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Annotations ; Computer architecture ; Domains ; Legislation ; Neural networks ; Optical character recognition ; Standard data ; Training ; Unstructured data</subject><ispartof>arXiv.org, 2020-12</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Skylaki, Stavroula</creatorcontrib><creatorcontrib>Oskooei, Ali</creatorcontrib><creatorcontrib>Bari, Omar</creatorcontrib><creatorcontrib>Herger, Nadja</creatorcontrib><creatorcontrib>Kriegman, Zac</creatorcontrib><title>Named Entity Recognition in the Legal Domain using a Pointer Generator Network</title><title>arXiv.org</title><description>Named Entity Recognition (NER) is the task of identifying and classifying named entities in unstructured text. In the legal domain, named entities of interest may include the case parties, judges, names of courts, case numbers, references to laws etc. We study the problem of legal NER with noisy text extracted from PDF files of filed court cases from US courts. The "gold standard" training data for NER systems provide annotation for each token of the text with the corresponding entity or non-entity label. We work with only partially complete training data, which differ from the gold standard NER data in that the exact location of the entities in the text is unknown and the entities may contain typos and/or OCR mistakes. To overcome the challenges of our noisy training data, e.g. text extraction errors and/or typos and unknown label indices, we formulate the NER task as a text-to-text sequence generation task and train a pointer generator network to generate the entities in the document rather than label them. We show that the pointer generator can be effective for NER in the absence of gold standard data and outperforms the common NER neural network architectures in long legal documents.</description><subject>Annotations</subject><subject>Computer architecture</subject><subject>Domains</subject><subject>Legislation</subject><subject>Neural networks</subject><subject>Optical character recognition</subject><subject>Standard data</subject><subject>Training</subject><subject>Unstructured data</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNisEKgkAUAJcgSMp_eNBZWFdNO5fVISSiuyz1sjV9r3ZXor-vQx_QaWBmRiJQSRJHRarURITOtVJKtchVliWBqCrd4wVK8sa_4Yhnbsh4wwSGwN8Q9tjoDtbc668YnKEGNBzYkEcLWyS02rOFCv2L7X0mxlfdOQx_nIr5pjytdtHD8nNA5-uWB0vfVKs0j7NCqmWa_Hd9AInwPbg</recordid><startdate>20201217</startdate><enddate>20201217</enddate><creator>Skylaki, Stavroula</creator><creator>Oskooei, Ali</creator><creator>Bari, Omar</creator><creator>Herger, Nadja</creator><creator>Kriegman, Zac</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20201217</creationdate><title>Named Entity Recognition in the Legal Domain using a Pointer Generator Network</title><author>Skylaki, Stavroula ; Oskooei, Ali ; Bari, Omar ; Herger, Nadja ; Kriegman, Zac</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_24715802943</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Annotations</topic><topic>Computer architecture</topic><topic>Domains</topic><topic>Legislation</topic><topic>Neural networks</topic><topic>Optical character recognition</topic><topic>Standard data</topic><topic>Training</topic><topic>Unstructured data</topic><toplevel>online_resources</toplevel><creatorcontrib>Skylaki, Stavroula</creatorcontrib><creatorcontrib>Oskooei, Ali</creatorcontrib><creatorcontrib>Bari, Omar</creatorcontrib><creatorcontrib>Herger, Nadja</creatorcontrib><creatorcontrib>Kriegman, Zac</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Skylaki, Stavroula</au><au>Oskooei, Ali</au><au>Bari, Omar</au><au>Herger, Nadja</au><au>Kriegman, Zac</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Named Entity Recognition in the Legal Domain using a Pointer Generator Network</atitle><jtitle>arXiv.org</jtitle><date>2020-12-17</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Named Entity Recognition (NER) is the task of identifying and classifying named entities in unstructured text. In the legal domain, named entities of interest may include the case parties, judges, names of courts, case numbers, references to laws etc. We study the problem of legal NER with noisy text extracted from PDF files of filed court cases from US courts. The "gold standard" training data for NER systems provide annotation for each token of the text with the corresponding entity or non-entity label. We work with only partially complete training data, which differ from the gold standard NER data in that the exact location of the entities in the text is unknown and the entities may contain typos and/or OCR mistakes. To overcome the challenges of our noisy training data, e.g. text extraction errors and/or typos and unknown label indices, we formulate the NER task as a text-to-text sequence generation task and train a pointer generator network to generate the entities in the document rather than label them. We show that the pointer generator can be effective for NER in the absence of gold standard data and outperforms the common NER neural network architectures in long legal documents.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2020-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2471580294
source	Free E- Journals
subjects	Annotations Computer architecture Domains Legislation Neural networks Optical character recognition Standard data Training Unstructured data
title	Named Entity Recognition in the Legal Domain using a Pointer Generator Network
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T10%3A59%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Named%20Entity%20Recognition%20in%20the%20Legal%20Domain%20using%20a%20Pointer%20Generator%20Network&rft.jtitle=arXiv.org&rft.au=Skylaki,%20Stavroula&rft.date=2020-12-17&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2471580294%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2471580294&rft_id=info:pmid/&rfr_iscdi=true