GML: Efficiently Auto-Tuning Flink's Configurations Via Guided Machine Learning

The increasingly popular fused batch-streaming big data framework, Apache Flink, has many performance-critical as well as untamed configuration parameters. However, how to tune them for optimal performance has not yet been explored. Machine learning (ML) has been chosen to tune the configurations fo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on parallel and distributed systems 2021-12, Vol.32 (12), p.2921-2935
Hauptverfasser: Guo, Yijin, Shan, Huasong, Huang, Shixin, Hwang, Kai, Fan, Jianping, Yu, Zhibin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The increasingly popular fused batch-streaming big data framework, Apache Flink, has many performance-critical as well as untamed configuration parameters. However, how to tune them for optimal performance has not yet been explored. Machine learning (ML) has been chosen to tune the configurations for other big data frameworks (e.g., Apache Spark), showing significant performance improvements. However, it needs a long time to collect a large amount of training data by nature. In this article, we propose a guided machine learning (GML) approach to tune the configurations of Flink with significantly shorter time for collecting training data compared to traditional ML approaches. GML innovates two techniques. First, it leverages generative adversarial networks (GANs) to generate a part of training data, reducing the time needed for training data collection. Second, GML guides a ML algorithm to select configurations that the corresponding performance is higher than the average performance of random configurations. We evaluate GML on a lab cluster with 4 servers and a real production cluster in an internet company. The results show that GML significantly outperforms the state-of-the-art, DAC (Datasize-Aware-Configuration) (Z. Yu et al. 2018) for tuning the configurations of Spark, with 2.4× of reduced data collection time but with 30 percent reduced 99th percentile latency. When GML is used in the internet company, it reduces the latency by up to 57.8× compared to the configurations made by the company.
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2021.3081600