Declarative Cleaning of Inconsistencies in Information Extraction

The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature and structure (from machine logs to informal...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on database systems 2016-04, Vol.41 (1), p.1-44
Hauptverfasser:	Fagin, Ronald, Kimelfeld, Benny, Reiss, Frederick, Vansummeren, Stijn
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Cleaning Extraction Information retrieval Maintenance Policies Repairing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	44
container_issue	1
container_start_page	1
container_title	ACM transactions on database systems
container_volume	41
creator	Fagin, Ronald Kimelfeld, Benny Reiss, Frederick Vansummeren, Stijn
description	The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature and structure (from machine logs to informal natural language), it is notoriously difficult to write IE programs that unambiguously extract the sought information. For example, during extraction, an IE program could annotate a substring as both an address and a person name. When this happens, the extracted information is said to be inconsistent , and some way of removing inconsistencies is crucial to compute the final output. Industrial-strength IE systems like GATE and IBM SystemT therefore provide a built-in collection of cleaning operations to remove inconsistencies from extracted relations. These operations, however, are collected in an ad hoc fashion through use cases. Ideally, we would like to allow IE developers to declare their own policies. But existing cleaning operations are defined in an algorithmic way, and hence it is not clear how to extend the built-in operations without requiring low-level coding of internal or external functions. We embark on the establishment of a framework for declarative cleaning of inconsistencies in IE through principles of database theory. Specifically, building upon the formalism of document spanners for IE, we adopt the concept of prioritized repairs , which has been recently proposed as an extension of the traditional database repairs to incorporate priorities among conflicting facts. We show that our framework captures the popular cleaning policies, as well as the POSIX semantics for extraction through regular expressions. We explore the problem of determining whether a cleaning declaration is unambiguous (i.e., always results in a single repair) and whether it increases the expressive power of the extraction language. We give both positive and negative results, some of which are general and some of which apply to policies used in practice.
doi_str_mv	10.1145/2877202
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1808053987</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1808053987</sourcerecordid><originalsourceid>FETCH-LOGICAL-c173t-e46067aab7694627d12cfcaec8a62bc0ea34a75bde2fd3a6e5a827ae0ad4a8143</originalsourceid><addsrcrecordid>eNotkE1Lw0AURQdRMFbxL2Snm-h8z2RZYtVCwY2uw8vkRUaSmTqTiv57W9rVvVwOd3EIuWX0gTGpHrk1hlN-RgqmlKmklvKcFFRoXqmaqUtylfMXpVTa2hRk-YRuhASz_8GyGRGCD59lHMp1cDFkn2cMzmMufdhPQ0zTHo2hXP3OCdyhXpOLAcaMN6dckI_n1XvzWm3eXtbNclM5ZsRcodRUG4DO6FpqbnrG3eAAnQXNO0cRhASjuh750AvQqMByA0ihl2CZFAtyf_zdpvi9wzy3k88OxxECxl1umaWWKlFbs0fvjqhLMeeEQ7tNfoL01zLaHiS1J0niHwYOWZI</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1808053987</pqid></control><display><type>article</type><title>Declarative Cleaning of Inconsistencies in Information Extraction</title><source>ACM Digital Library</source><creator>Fagin, Ronald ; Kimelfeld, Benny ; Reiss, Frederick ; Vansummeren, Stijn</creator><creatorcontrib>Fagin, Ronald ; Kimelfeld, Benny ; Reiss, Frederick ; Vansummeren, Stijn</creatorcontrib><description>The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature and structure (from machine logs to informal natural language), it is notoriously difficult to write IE programs that unambiguously extract the sought information. For example, during extraction, an IE program could annotate a substring as both an address and a person name. When this happens, the extracted information is said to be inconsistent , and some way of removing inconsistencies is crucial to compute the final output. Industrial-strength IE systems like GATE and IBM SystemT therefore provide a built-in collection of cleaning operations to remove inconsistencies from extracted relations. These operations, however, are collected in an ad hoc fashion through use cases. Ideally, we would like to allow IE developers to declare their own policies. But existing cleaning operations are defined in an algorithmic way, and hence it is not clear how to extend the built-in operations without requiring low-level coding of internal or external functions. We embark on the establishment of a framework for declarative cleaning of inconsistencies in IE through principles of database theory. Specifically, building upon the formalism of document spanners for IE, we adopt the concept of prioritized repairs , which has been recently proposed as an extension of the traditional database repairs to incorporate priorities among conflicting facts. We show that our framework captures the popular cleaning policies, as well as the POSIX semantics for extraction through regular expressions. We explore the problem of determining whether a cleaning declaration is unambiguous (i.e., always results in a single repair) and whether it increases the expressive power of the extraction language. We give both positive and negative results, some of which are general and some of which apply to policies used in practice.</description><identifier>ISSN: 0362-5915</identifier><identifier>EISSN: 1557-4644</identifier><identifier>DOI: 10.1145/2877202</identifier><language>eng</language><subject>Algorithms ; Cleaning ; Extraction ; Information retrieval ; Maintenance ; Policies ; Repairing</subject><ispartof>ACM transactions on database systems, 2016-04, Vol.41 (1), p.1-44</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c173t-e46067aab7694627d12cfcaec8a62bc0ea34a75bde2fd3a6e5a827ae0ad4a8143</citedby><cites>FETCH-LOGICAL-c173t-e46067aab7694627d12cfcaec8a62bc0ea34a75bde2fd3a6e5a827ae0ad4a8143</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Fagin, Ronald</creatorcontrib><creatorcontrib>Kimelfeld, Benny</creatorcontrib><creatorcontrib>Reiss, Frederick</creatorcontrib><creatorcontrib>Vansummeren, Stijn</creatorcontrib><title>Declarative Cleaning of Inconsistencies in Information Extraction</title><title>ACM transactions on database systems</title><description>The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature and structure (from machine logs to informal natural language), it is notoriously difficult to write IE programs that unambiguously extract the sought information. For example, during extraction, an IE program could annotate a substring as both an address and a person name. When this happens, the extracted information is said to be inconsistent , and some way of removing inconsistencies is crucial to compute the final output. Industrial-strength IE systems like GATE and IBM SystemT therefore provide a built-in collection of cleaning operations to remove inconsistencies from extracted relations. These operations, however, are collected in an ad hoc fashion through use cases. Ideally, we would like to allow IE developers to declare their own policies. But existing cleaning operations are defined in an algorithmic way, and hence it is not clear how to extend the built-in operations without requiring low-level coding of internal or external functions. We embark on the establishment of a framework for declarative cleaning of inconsistencies in IE through principles of database theory. Specifically, building upon the formalism of document spanners for IE, we adopt the concept of prioritized repairs , which has been recently proposed as an extension of the traditional database repairs to incorporate priorities among conflicting facts. We show that our framework captures the popular cleaning policies, as well as the POSIX semantics for extraction through regular expressions. We explore the problem of determining whether a cleaning declaration is unambiguous (i.e., always results in a single repair) and whether it increases the expressive power of the extraction language. We give both positive and negative results, some of which are general and some of which apply to policies used in practice.</description><subject>Algorithms</subject><subject>Cleaning</subject><subject>Extraction</subject><subject>Information retrieval</subject><subject>Maintenance</subject><subject>Policies</subject><subject>Repairing</subject><issn>0362-5915</issn><issn>1557-4644</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2016</creationdate><recordtype>article</recordtype><recordid>eNotkE1Lw0AURQdRMFbxL2Snm-h8z2RZYtVCwY2uw8vkRUaSmTqTiv57W9rVvVwOd3EIuWX0gTGpHrk1hlN-RgqmlKmklvKcFFRoXqmaqUtylfMXpVTa2hRk-YRuhASz_8GyGRGCD59lHMp1cDFkn2cMzmMufdhPQ0zTHo2hXP3OCdyhXpOLAcaMN6dckI_n1XvzWm3eXtbNclM5ZsRcodRUG4DO6FpqbnrG3eAAnQXNO0cRhASjuh750AvQqMByA0ihl2CZFAtyf_zdpvi9wzy3k88OxxECxl1umaWWKlFbs0fvjqhLMeeEQ7tNfoL01zLaHiS1J0niHwYOWZI</recordid><startdate>20160407</startdate><enddate>20160407</enddate><creator>Fagin, Ronald</creator><creator>Kimelfeld, Benny</creator><creator>Reiss, Frederick</creator><creator>Vansummeren, Stijn</creator><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20160407</creationdate><title>Declarative Cleaning of Inconsistencies in Information Extraction</title><author>Fagin, Ronald ; Kimelfeld, Benny ; Reiss, Frederick ; Vansummeren, Stijn</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c173t-e46067aab7694627d12cfcaec8a62bc0ea34a75bde2fd3a6e5a827ae0ad4a8143</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2016</creationdate><topic>Algorithms</topic><topic>Cleaning</topic><topic>Extraction</topic><topic>Information retrieval</topic><topic>Maintenance</topic><topic>Policies</topic><topic>Repairing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fagin, Ronald</creatorcontrib><creatorcontrib>Kimelfeld, Benny</creatorcontrib><creatorcontrib>Reiss, Frederick</creatorcontrib><creatorcontrib>Vansummeren, Stijn</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>ACM transactions on database systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Fagin, Ronald</au><au>Kimelfeld, Benny</au><au>Reiss, Frederick</au><au>Vansummeren, Stijn</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Declarative Cleaning of Inconsistencies in Information Extraction</atitle><jtitle>ACM transactions on database systems</jtitle><date>2016-04-07</date><risdate>2016</risdate><volume>41</volume><issue>1</issue><spage>1</spage><epage>44</epage><pages>1-44</pages><issn>0362-5915</issn><eissn>1557-4644</eissn><abstract>The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature and structure (from machine logs to informal natural language), it is notoriously difficult to write IE programs that unambiguously extract the sought information. For example, during extraction, an IE program could annotate a substring as both an address and a person name. When this happens, the extracted information is said to be inconsistent , and some way of removing inconsistencies is crucial to compute the final output. Industrial-strength IE systems like GATE and IBM SystemT therefore provide a built-in collection of cleaning operations to remove inconsistencies from extracted relations. These operations, however, are collected in an ad hoc fashion through use cases. Ideally, we would like to allow IE developers to declare their own policies. But existing cleaning operations are defined in an algorithmic way, and hence it is not clear how to extend the built-in operations without requiring low-level coding of internal or external functions. We embark on the establishment of a framework for declarative cleaning of inconsistencies in IE through principles of database theory. Specifically, building upon the formalism of document spanners for IE, we adopt the concept of prioritized repairs , which has been recently proposed as an extension of the traditional database repairs to incorporate priorities among conflicting facts. We show that our framework captures the popular cleaning policies, as well as the POSIX semantics for extraction through regular expressions. We explore the problem of determining whether a cleaning declaration is unambiguous (i.e., always results in a single repair) and whether it increases the expressive power of the extraction language. We give both positive and negative results, some of which are general and some of which apply to policies used in practice.</abstract><doi>10.1145/2877202</doi><tpages>44</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0362-5915
ispartof	ACM transactions on database systems, 2016-04, Vol.41 (1), p.1-44
issn	0362-5915 1557-4644
language	eng
recordid	cdi_proquest_miscellaneous_1808053987
source	ACM Digital Library
subjects	Algorithms Cleaning Extraction Information retrieval Maintenance Policies Repairing
title	Declarative Cleaning of Inconsistencies in Information Extraction
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T19%3A54%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Declarative%20Cleaning%20of%20Inconsistencies%20in%20Information%20Extraction&rft.jtitle=ACM%20transactions%20on%20database%20systems&rft.au=Fagin,%20Ronald&rft.date=2016-04-07&rft.volume=41&rft.issue=1&rft.spage=1&rft.epage=44&rft.pages=1-44&rft.issn=0362-5915&rft.eissn=1557-4644&rft_id=info:doi/10.1145/2877202&rft_dat=%3Cproquest_cross%3E1808053987%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1808053987&rft_id=info:pmid/&rfr_iscdi=true