Deep Clustering for Data Cleaning and Integration
Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which da...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Deep Learning (DL) techniques now constitute the state-of-the-art for
important problems in areas such as text and image processing, and there have
been impactful results that deploy DL in several data management tasks. Deep
Clustering (DC) has recently emerged as a sub-discipline of DL, in which data
representations are learned in tandem with clustering, with a view to
automatically identifying the features of the data that lead to improved
clustering results. While DC has been used to good effect in several domains,
particularly in image processing, the impact of DC on mainstream data
management tasks remains unexplored. In this paper, we address this gap by
investigating the impact of DC in data cleaning and integration tasks,
specifically schema inference, entity resolution, and domain discovery, tasks
that represent clustering from the perspective of tables, rows, and columns,
respectively. In this setting, we compare and contrast several DC and non-DC
clustering algorithms using standard benchmarks. The results show, among other
things, that the most effective DC algorithms consistently outperform non-DC
clustering algorithms for data integration tasks. However, we observed a
significant correlation between the DC method and embedding approaches for
rows, columns, and tables, highlighting that the suitable combination can
enhance the efficiency of DC methods. |
---|---|
DOI: | 10.48550/arxiv.2305.13494 |