Hadoop Performance Prediction Model Based on Random Forest

MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be manually tunned. This is not only time-consuming but also error-pron. In this paper,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ZTE Communications 2013-06, Vol.11 (2), p.38-44
Hauptverfasser:	Bei, Z, Yu, Z, Zhang, H, Xu, C, Feng, S, Dong, Z
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Forests Freeware Mathematical models Performance prediction Programming p系统 Workload 性能预测机器学习算法森林模型基测试套件配置参数随机
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be manually tunned. This is not only time-consuming but also error-pron. In this paper, we propose a new performance model based on random forest, a recently devel- oped machine-learning algorithm. The model, called RFMS, is used to predict the performance of a Hadoop system according to the system＇ s configuration parameters. RFMS is created from 2000 distinct fine-grained performance observations with different Hadoop configurations. We test RFMS against the measured performance of representative workloads from the Hadoop Micro-benchmark suite. The results show that the prediction accuracy of RFMS achieves 95% on average and up to 99%. This new, highly accurate prediction model can be used to automatically optimize the performance of Hadoop systems.
ISSN:	1673-5188
DOI:	10.3969/j.issn.1673-5188.2013.02.006