Enhanced Regular Expression as a DGL for Generation of Synthetic Big Data
Synthetic data generation is generally used in performance evaluation and function tests in data-intensiveapplications, as well as in various areas of data analytics, such as privacy-preserving data publishing (PPDP)and statistical disclosure limit/control. A significant amount of research has been...
Gespeichert in:
Veröffentlicht in: | Journal of information processing systems 2023, 19(1), 79, pp.1-16 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Synthetic data generation is generally used in performance evaluation and function tests in data-intensiveapplications, as well as in various areas of data analytics, such as privacy-preserving data publishing (PPDP)and statistical disclosure limit/control. A significant amount of research has been conducted on tools andlanguages for data generation. However, existing tools and languages have been developed for specificpurposes and are unsuitable for other domains. In this article, we propose a regular expression-based datageneration language (DGL) for flexible big data generation. To achieve a general-purpose and powerful DGL,we enhanced the standard regular expressions to support the data domain, type/format inference, sequence andrandom generation, probability distributions, and resource reference. To efficiently implement the proposedlanguage, we propose caching techniques for both the intermediate and database queries. We evaluated theproposed improvement experimentally. KCI Citation Count: 0 |
---|---|
ISSN: | 1976-913X 2092-805X |
DOI: | 10.3745/JIPS.04.0262 |