LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes

Discovering tables from poorly maintained data lakes is a significant challenge in data management. Two key tasks are identifying joinable and unionable tables, crucial for data integration, analysis, and machine learning. However, there's a lack of a comprehensive benchmark for evaluating exis...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the VLDB Endowment 2024-04, Vol.17 (8), p.1925-1938
Hauptverfasser: Deng, Yuhao, Chai, Chengliang, Cao, Lei, Yuan, Qin, Chen, Siyuan, Yu, Yanrui, Sun, Zhaoze, Wang, Junyi, Li, Jiajun, Cao, Ziqi, Jin, Kaisen, Zhang, Chi, Jiang, Yuqing, Zhang, Yuanfang, Wang, Yuping, Yuan, Ye, Wang, Guoren, Tang, Nan
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Discovering tables from poorly maintained data lakes is a significant challenge in data management. Two key tasks are identifying joinable and unionable tables, crucial for data integration, analysis, and machine learning. However, there's a lack of a comprehensive benchmark for evaluating existing methods. To address this, we introduce LakeBench, a large-scale table discovery benchmark. It evaluates effectiveness, efficiency, and scalability of table join & union search methods. With over 16 million real tables, LakeBench is 1,600X larger than existing datasets and 100X larger in storage size. It includes synthesized and real queries with ground truth, totaling more than 10 thousand queries - 10X more than used in any existing evaluation. We spent over 7,500 human hours labeling these queries and constructing diverse query categories for thorough evaluation. Our benchmark thoroughly evaluates state-of-the-art table discovery methods, providing insights into their performance and highlighting research opportunities.
ISSN:2150-8097
2150-8097
DOI:10.14778/3659437.3659448