Demeter: An automatic framework for data migration in open data lakes

An open data lake stores various forms and types of open data, and there is an increasing demand to manage raw data in tables rather than files for efficient data exploration and analysis. In this paper, we investigate the data management of open data lakes and recognize the limitations of table mig...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Software, practice & experience practice & experience, 2024-05, Vol.54 (5), p.721-743
Hauptverfasser: Kim, Dasol, Han, Jiwoo, Son, Siwoon, Gil, Myeong‐Seon, Moon, Yang‐Sae, Won, Heesun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:An open data lake stores various forms and types of open data, and there is an increasing demand to manage raw data in tables rather than files for efficient data exploration and analysis. In this paper, we investigate the data management of open data lakes and recognize the limitations of table migration and related problems. First, open data lakes have problems of preprocessing complexity, scale limitation, and platform dependency due to the traditional data management method and open data characteristics. Second, existing studies for table migration have problems of lack of scalability, migration incompleteness, and scale limitation. In this work, we present a novel automation framework, called Demeter, which solves three problems inherent in open data lakes by expanding automation. Specifically, it supports automating catalog collection and preprocessing tasks to solve preprocessing complexity and scale limitation. It also supports platform universality for representative data platforms through the automation of catalog analysis and detailed processing logic. Demeter then solves three problems in table migration by adopting Airbyte, an open‐source ELT platform, and by enhancing automation capability with the Airbyte manager. We verify that Demeter resolves all the problems above through extensive experiments and proves its scalability and universality. In addition, significantly outperforms CKAN by Demeter up to 508.5% in automation performance, up to 207.28% in processing time, and up to 917.17% in migration performance. These results indicate that Demeter is an excellent automation framework that increases the utilization of large‐scale open data and supports reliable Internet‐scale migration.
ISSN:0038-0644
1097-024X
DOI:10.1002/spe.3294