Improving I/O Efficiency in Hadoop-Based Massive Data Analysis Programs

Apache Hadoop has been a popular parallel processing tool in the era of big data. While practitioners have rewritten many conventional analysis algorithms to make them customized to Hadoop, the issue of inefficient I/O in Hadoop-based programs has been repeatedly reported in the literature. In this...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Scientific programming 2018-01, Vol.2018 (2018), p.1-9
Hauptverfasser:	Lee, Kyong-Ha, Suh, Young-Kyoon, Kang, Woo Lam
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Big Data Data analysis Data compression Data management Data processing Efficiency Endowment Fault tolerance International conferences Parallel processing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Apache Hadoop has been a popular parallel processing tool in the era of big data. While practitioners have rewritten many conventional analysis algorithms to make them customized to Hadoop, the issue of inefficient I/O in Hadoop-based programs has been repeatedly reported in the literature. In this article, we address the problem of the I/O inefficiency in Hadoop-based massive data analysis by introducing our efficient modification of Hadoop. We first incorporate a columnar data layout into the conventional Hadoop framework, without any modification of the Hadoop internals. We also provide Hadoop with indexing capability to save a huge amount of I/O while processing not only selection predicates but also star-join queries that are often used in many analysis tasks.
ISSN:	1058-9244 1875-919X
DOI:	10.1155/2018/2682085