Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDance

At ByteDance, where we execute over a million Spark jobs and handle 500PB of shuffled data daily, ensuring resource efficiency is paramount for cost savings. However, achieving optimization of resource efficiency in large-scale production environments poses significant challenges. Drawing from our p...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Proceedings of the VLDB Endowment 2024-08, Vol.17 (12), p.3759-3771
Hauptverfasser:	Wu, Yixin, Huang, Xiuqi, Wei, Zhongjia, Cheng, Hang, Xin, Chaohui, Chen, Zuzhi, Chen, Binbin, Wu, Yufei, Wang, Hao, Zhang, Tieying, Shi, Rui, Gao, Xiaofeng, Liang, Yuming, Zhao, Pengwei, Chen, Guihai
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	At ByteDance, where we execute over a million Spark jobs and handle 500PB of shuffled data daily, ensuring resource efficiency is paramount for cost savings. However, achieving optimization of resource efficiency in large-scale production environments poses significant challenges. Drawing from our practical experiences, we have identified three key issues critical to addressing resource efficiency in real-world production settings: 1 slow I/Os leading to excessive CPU and memory idleness, 2 coarse-grained resource control causing wastage, and 3 sub-optimal job configurations resulting in low utilization. To tackle these issues, we propose a resource efficiency governance framework for Spark workloads. Specifically, 1 we devise the multi-mechanism shuffle services, including Enhanced External Shuffle Service (ESS) and Cloud Shuffle Service (CSS), where CSS employs a push-based approach to enhance I/O efficiency through sequential reading. 2 We modify the Spark configuration parameter protocol, allowing for fine-grained resource control by introducing several new parameters such as milliCores and memoryBurst, as well as supporting operators with additional spill modes. 3 We design a two-stage configuration autotuning method, comprising rule-based and algorithm-based tuning, providing more reliable Spark configuration optimizations. By deploying these techniques on millions of Spark jobs in production over the last two years, we have achieved over 22% CPU utilization increase, 5% memory utilization increase, and 10% shuffle block time ratio decrease, effectively saving millions of CPU cores and petabytes of memory daily.
ISSN:	2150-8097 2150-8097
DOI:	10.14778/3685800.3685804