Lenses: an on-demand approach to ETL

Three mentalities have emerged in analytics. One view holds that reliable analytics is impossible without high-quality data, and relies on heavy-duty ETL processes and upfront data curation to provide it. The second view takes a more ad-hoc approach, collecting data into a data lake, and placing res...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the VLDB Endowment 2015-08, Vol.8 (12), p.1578-1589
Hauptverfasser: Yang, Ying, Meneghetti, Niccolò, Fehling, Ronny, Liu, Zhen Hua, Kennedy, Oliver
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Three mentalities have emerged in analytics. One view holds that reliable analytics is impossible without high-quality data, and relies on heavy-duty ETL processes and upfront data curation to provide it. The second view takes a more ad-hoc approach, collecting data into a data lake, and placing responsibility for data quality on the analyst querying it. A third, on-demand approach has emerged over the past decade in the form of numerous systems like Paygo or HLog, which allow for incremental curation of the data and help analysts to make principled trade-offs between data quality and effort. Though quite useful in isolation, these systems target only specific quality problems (e.g., Paygo targets only schema matching and entity resolution). In this paper, we explore the design of a general, extensible infrastructure for on-demand curation that is based on probabilistic query processing. We illustrate its generality through examples and show how such an infrastructure can be used to gracefully make existing ETL workflows "on-demand". Finally, we present a user interface for On-Demand ETL and address ensuing challenges, including that of efficiently ranking potential data curation tasks. Our experimental results show that On-Demand ETL is feasible and that our greedy ranking strategy for curation tasks, called CPI, is effective.
ISSN:2150-8097
2150-8097
DOI:10.14778/2824032.2824055