Adaptive and Robust Query Execution for Lakehouses at Scale

Many organizations have embraced the "Lakehouse" data management paradigm, which involves constructing structured data warehouses on top of open, unstructured data lakes. This approach stands in stark contrast to traditional, closed, relational databases and introduces challenges for perfo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the VLDB Endowment 2024-08, Vol.17 (12), p.3947-3959
Hauptverfasser: Xue, Maryann, Bu, Yingyi, Somani, Abhishek, Fan, Wenchen, Liu, Ziqi, Chen, Steven, van Hovell, Herman, Samwel, Bart, Mokhtar, Mostafa, Korlapati, RK, Lam, Andy, Ma, Yunxiao, Ercegovac, Vuk, Li, Jiexing, Behm, Alexander, Li, Yuanjian, Li, Xiao, Krishnamurthy, Sriram, Shukla, Amit, Petropoulos, Michalis, Paranjpye, Sameer, Xin, Reynold, Zaharia, Matei
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Many organizations have embraced the "Lakehouse" data management paradigm, which involves constructing structured data warehouses on top of open, unstructured data lakes. This approach stands in stark contrast to traditional, closed, relational databases and introduces challenges for performance and stability of distributed query processors. Firstly, in large-scale, open Lakehouses with uncurated data, high ingestion rates, external tables, or deeply nested schemas, it is often costly or wasteful to maintain perfect and up-to-date table and column statistics. Secondly, inherently imperfect cardinality estimates with conjunctive predicates, joins and user-defined functions can lead to bad query plans. Thirdly, for the sheer magnitude of data involved, strictly relying on static query plan decisions can result in performance and stability issues such as excessive data movement, substantial disk spillage, or high memory pressure. To address these challenges, this paper presents our design, implementation, evaluation and practice of the Adaptive Query Execution (AQE) framework, which exploits natural execution pipeline breakers in query plans to collect accurate statistics and re-optimize them at runtime for both performance and robustness. In the TPC-DS benchmark, the technique demonstrates up to 25× per query speedup. At Databricks, AQE has been successfully deployed in production for multiple years. It powers billions of queries and ETL jobs to process exabytes of data per day, through key enterprise products such as Databricks Runtime, Databricks SQL, and Delta Live Tables.
ISSN:2150-8097
2150-8097
DOI:10.14778/3685800.3685818