Bigflow: A General Optimization Layer for Distributed Computing Frameworks

As data volumes grow rapidly, distributed computations are widely employed in data-centers to provide cheap and efficient methods to process large-scale parallel datasets. Various computation models have been proposed to improve the abstraction of distributed datasets and hide the details of paralle...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of computer science and technology 2020-03, Vol.35 (2), p.453-467
Hauptverfasser: Zhang, Yun-Cong, Wang, Xiao-Yang, Wang, Cong, Xu, Yao, Zhang, Jian-Wei, Lin, Xiao-Dong, Sun, Guang-Yu, Zheng, Gong-Lin, Yin, Shan-Hui, Ye, Xian-Jin, Li, Li, Song, Zhan, Miao, Dong-Dong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:As data volumes grow rapidly, distributed computations are widely employed in data-centers to provide cheap and efficient methods to process large-scale parallel datasets. Various computation models have been proposed to improve the abstraction of distributed datasets and hide the details of parallelism. However, most of them follow the single-layer partitioning method, which limits developers to express a multi-level partitioning operation succinctly. To overcome the problem, we present the NDD (Nested Distributed Dataset) data model. It is a more compact and expressive extension of Spark RDD (Resilient Distributed Dataset), in order to remove the burden on developers to manually write the logic for multi-level partitioning cases. Base on the NDD model, we develop an open-source framework called Bigflow, which serves as an optimization layer over computation engines from most widely used processing frameworks. With the help of Bigflow, some advanced optimization techniques, which may only be applied by experienced programmers manually, are enabled automatically in a distributed data processing job. Currently, Bigflow is processing about 3 PB data volumes daily in the data-centers of Baidu. According to customer experience, it can significantly save code length and improve performance over the intuitive programming style.
ISSN:1000-9000
1860-4749
DOI:10.1007/s11390-020-9702-3