Schema mapping generation in the wild
Schema mappings enable declarative and executable specification of transformations between different schematic representations of application concepts. Most work on mapping generation has assumed that the source and target schemas are well defined, e.g., with declared keys and foreign keys, and that...
Gespeichert in:
Veröffentlicht in: | Information systems (Oxford) 2022-02, Vol.104, p.101904, Article 101904 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Schema mappings enable declarative and executable specification of transformations between different schematic representations of application concepts. Most work on mapping generation has assumed that the source and target schemas are well defined, e.g., with declared keys and foreign keys, and that the mapping generation processes exist to support the data engineer in the labour-intensive process of producing a high-quality integration. However, organizations increasingly have access to numerous independently produced datasets, e.g., in a data lake, with a requirement to produce rapid, best-effort integrations, without extensive manual effort. As a result, there is a need to generate mappings in settings without declared relationships, and thus on the basis of inferred profiling data, and over large numbers of sources. Our contributions include a dynamic programming algorithm for exploring the space of potential mappings, and techniques for propagating profiling data through mappings, so that the fitness of candidate mappings can be estimated. The paper also describes how the resulting mappings can be used to populate single and multi-relation target schemas. Experimental results show the effectiveness and scalability of the approach in a variety of synthetic and real-world scenarios.
•A dynamic programming algorithm searches a space of mappings for a set of relations.•The source relations are merged on the basis of relational metadata and profile data.•An inference mechanism for propagating profile data without data materialization.•A fitness function for comparing candidate mappings using inferred profiling data.•A method that populates a multi-relation target schema with constraints.•An empirical evaluation on real-world and (benchmark) synthetic datasets. |
---|---|
ISSN: | 0306-4379 1873-6076 |
DOI: | 10.1016/j.is.2021.101904 |