Ensuring Data Readiness for Quality Requirements with Help from Procedure Reuse

Assessing and improving the quality of data are fundamental challenges in Big-Data applications. These challenges have given rise to numerous solutions targeting transformation, integration, and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM journal of data and information quality 2021-09, Vol.13 (3), p.1-15
Hauptverfasser:	Chirkova, Rada, Doyle, Jon, Reutter, Juan
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	15
container_issue	3
container_start_page	1
container_title	ACM journal of data and information quality
container_volume	13
creator	Chirkova, Rada Doyle, Jon Reutter, Juan
description	Assessing and improving the quality of data are fundamental challenges in Big-Data applications. These challenges have given rise to numerous solutions targeting transformation, integration, and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably well understood in isolation, not much attention has been given to the interplay between standalone tools in these areas. In this article, we focus on the problem of determining whether the available data-transforming procedures can be used together to bring about the desired quality characteristics of the data in business or analytics processes. For example, to help an organization avoid building a data-quality solution from scratch when facing a new analytics task, we ask whether the data quality can be improved by reusing the tools that are already available, and if so, which tools to apply, and in which order, all without presuming knowledge of the internals of the tools, which may be external or proprietary. Toward addressing this problem, we conduct a formal study in which individual data cleaning, data migration, or other data-transforming tools are abstracted as black-box procedures with only some of the properties exposed, such as their applicability requirements, the parts of the data that the procedure modifies, and the conditions that the data satisfy once the procedure has been applied. As a proof of concept, we provide foundational results on sequential applications of procedures abstracted in this way, to achieve prespecified data-quality objectives, for the use case of relational data and for procedures described by standard relational constraints. We show that, while reasoning in this framework may be computationally infeasible in general, there exist well-behaved cases in which these foundational results can be applied in practice for achieving desired data-quality results on Big Data.
doi_str_mv	10.1145/3428154
format	Article
fullrecord	<record><control><sourceid>crossref</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3428154</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1145_3428154</sourcerecordid><originalsourceid>FETCH-LOGICAL-c187t-32ee57a6e73564555ab4e236011b26bf997b140e70e9cbc4932f400e327b1af13</originalsourceid><addsrcrecordid>eNo9kM1KAzEYRYMoWKv4Ctm5Gs2X38lSarVCoSq6HjLpF43MT01mkL69IxZX93K43MUh5BLYNYBUN0LyEpQ8IjOwQhdgtTj-70qdkrOcPxnTJZcwI5tll8cUu3d65wZHX9BtY4c509An-jy6Jg77iX6NMWGL3ZDpdxw-6AqbHQ2pb-lT6j1ux4TTasx4Tk6CazJeHHJO3u6Xr4tVsd48PC5u14WH0gyF4IjKOI1GKC2VUq6WyIVmADXXdbDW1CAZGobW115awYNkDAWfuAsg5uTq79enPueEodql2Lq0r4BVvx6qgwfxAytLTuw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Ensuring Data Readiness for Quality Requirements with Help from Procedure Reuse</title><source>Access via ACM Digital Library</source><creator>Chirkova, Rada ; Doyle, Jon ; Reutter, Juan</creator><creatorcontrib>Chirkova, Rada ; Doyle, Jon ; Reutter, Juan</creatorcontrib><description>Assessing and improving the quality of data are fundamental challenges in Big-Data applications. These challenges have given rise to numerous solutions targeting transformation, integration, and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably well understood in isolation, not much attention has been given to the interplay between standalone tools in these areas. In this article, we focus on the problem of determining whether the available data-transforming procedures can be used together to bring about the desired quality characteristics of the data in business or analytics processes. For example, to help an organization avoid building a data-quality solution from scratch when facing a new analytics task, we ask whether the data quality can be improved by reusing the tools that are already available, and if so, which tools to apply, and in which order, all without presuming knowledge of the internals of the tools, which may be external or proprietary. Toward addressing this problem, we conduct a formal study in which individual data cleaning, data migration, or other data-transforming tools are abstracted as black-box procedures with only some of the properties exposed, such as their applicability requirements, the parts of the data that the procedure modifies, and the conditions that the data satisfy once the procedure has been applied. As a proof of concept, we provide foundational results on sequential applications of procedures abstracted in this way, to achieve prespecified data-quality objectives, for the use case of relational data and for procedures described by standard relational constraints. We show that, while reasoning in this framework may be computationally infeasible in general, there exist well-behaved cases in which these foundational results can be applied in practice for achieving desired data-quality results on Big Data.</description><identifier>ISSN: 1936-1955</identifier><identifier>EISSN: 1936-1963</identifier><identifier>DOI: 10.1145/3428154</identifier><language>eng</language><ispartof>ACM journal of data and information quality, 2021-09, Vol.13 (3), p.1-15</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c187t-32ee57a6e73564555ab4e236011b26bf997b140e70e9cbc4932f400e327b1af13</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>315,781,785,27929,27930</link.rule.ids></links><search><creatorcontrib>Chirkova, Rada</creatorcontrib><creatorcontrib>Doyle, Jon</creatorcontrib><creatorcontrib>Reutter, Juan</creatorcontrib><title>Ensuring Data Readiness for Quality Requirements with Help from Procedure Reuse</title><title>ACM journal of data and information quality</title><description>Assessing and improving the quality of data are fundamental challenges in Big-Data applications. These challenges have given rise to numerous solutions targeting transformation, integration, and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably well understood in isolation, not much attention has been given to the interplay between standalone tools in these areas. In this article, we focus on the problem of determining whether the available data-transforming procedures can be used together to bring about the desired quality characteristics of the data in business or analytics processes. For example, to help an organization avoid building a data-quality solution from scratch when facing a new analytics task, we ask whether the data quality can be improved by reusing the tools that are already available, and if so, which tools to apply, and in which order, all without presuming knowledge of the internals of the tools, which may be external or proprietary. Toward addressing this problem, we conduct a formal study in which individual data cleaning, data migration, or other data-transforming tools are abstracted as black-box procedures with only some of the properties exposed, such as their applicability requirements, the parts of the data that the procedure modifies, and the conditions that the data satisfy once the procedure has been applied. As a proof of concept, we provide foundational results on sequential applications of procedures abstracted in this way, to achieve prespecified data-quality objectives, for the use case of relational data and for procedures described by standard relational constraints. We show that, while reasoning in this framework may be computationally infeasible in general, there exist well-behaved cases in which these foundational results can be applied in practice for achieving desired data-quality results on Big Data.</description><issn>1936-1955</issn><issn>1936-1963</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNo9kM1KAzEYRYMoWKv4Ctm5Gs2X38lSarVCoSq6HjLpF43MT01mkL69IxZX93K43MUh5BLYNYBUN0LyEpQ8IjOwQhdgtTj-70qdkrOcPxnTJZcwI5tll8cUu3d65wZHX9BtY4c509An-jy6Jg77iX6NMWGL3ZDpdxw-6AqbHQ2pb-lT6j1ux4TTasx4Tk6CazJeHHJO3u6Xr4tVsd48PC5u14WH0gyF4IjKOI1GKC2VUq6WyIVmADXXdbDW1CAZGobW115awYNkDAWfuAsg5uTq79enPueEodql2Lq0r4BVvx6qgwfxAytLTuw</recordid><startdate>20210930</startdate><enddate>20210930</enddate><creator>Chirkova, Rada</creator><creator>Doyle, Jon</creator><creator>Reutter, Juan</creator><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20210930</creationdate><title>Ensuring Data Readiness for Quality Requirements with Help from Procedure Reuse</title><author>Chirkova, Rada ; Doyle, Jon ; Reutter, Juan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c187t-32ee57a6e73564555ab4e236011b26bf997b140e70e9cbc4932f400e327b1af13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Chirkova, Rada</creatorcontrib><creatorcontrib>Doyle, Jon</creatorcontrib><creatorcontrib>Reutter, Juan</creatorcontrib><collection>CrossRef</collection><jtitle>ACM journal of data and information quality</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Chirkova, Rada</au><au>Doyle, Jon</au><au>Reutter, Juan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Ensuring Data Readiness for Quality Requirements with Help from Procedure Reuse</atitle><jtitle>ACM journal of data and information quality</jtitle><date>2021-09-30</date><risdate>2021</risdate><volume>13</volume><issue>3</issue><spage>1</spage><epage>15</epage><pages>1-15</pages><issn>1936-1955</issn><eissn>1936-1963</eissn><abstract>Assessing and improving the quality of data are fundamental challenges in Big-Data applications. These challenges have given rise to numerous solutions targeting transformation, integration, and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably well understood in isolation, not much attention has been given to the interplay between standalone tools in these areas. In this article, we focus on the problem of determining whether the available data-transforming procedures can be used together to bring about the desired quality characteristics of the data in business or analytics processes. For example, to help an organization avoid building a data-quality solution from scratch when facing a new analytics task, we ask whether the data quality can be improved by reusing the tools that are already available, and if so, which tools to apply, and in which order, all without presuming knowledge of the internals of the tools, which may be external or proprietary. Toward addressing this problem, we conduct a formal study in which individual data cleaning, data migration, or other data-transforming tools are abstracted as black-box procedures with only some of the properties exposed, such as their applicability requirements, the parts of the data that the procedure modifies, and the conditions that the data satisfy once the procedure has been applied. As a proof of concept, we provide foundational results on sequential applications of procedures abstracted in this way, to achieve prespecified data-quality objectives, for the use case of relational data and for procedures described by standard relational constraints. We show that, while reasoning in this framework may be computationally infeasible in general, there exist well-behaved cases in which these foundational results can be applied in practice for achieving desired data-quality results on Big Data.</abstract><doi>10.1145/3428154</doi><tpages>15</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 1936-1955
ispartof	ACM journal of data and information quality, 2021-09, Vol.13 (3), p.1-15
issn	1936-1955 1936-1963
language	eng
recordid	cdi_crossref_primary_10_1145_3428154
source	Access via ACM Digital Library
title	Ensuring Data Readiness for Quality Requirements with Help from Procedure Reuse
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-12T16%3A13%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Ensuring%20Data%20Readiness%20for%20Quality%20Requirements%20with%20Help%20from%20Procedure%20Reuse&rft.jtitle=ACM%20journal%20of%20data%20and%20information%20quality&rft.au=Chirkova,%20Rada&rft.date=2021-09-30&rft.volume=13&rft.issue=3&rft.spage=1&rft.epage=15&rft.pages=1-15&rft.issn=1936-1955&rft.eissn=1936-1963&rft_id=info:doi/10.1145/3428154&rft_dat=%3Ccrossref%3E10_1145_3428154%3C/crossref%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true