Fast scalable selection algorithms for large scale data

Selection finding, and its most common form median finding, are used as a measure of central tendency for problems in biology, databases, and graphics. These problems often require selection finding as a subcomponent where it can be called many times, and as such speed is important. The Map/Reduce f...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Thompson, Lee Parnell, Weijia Xu, Miranker, Daniel P.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Algorithm design and analysis Clustering algorithms Computer science Educational institutions Hadoop Libraries Map Reduce Median Finding Partitioning algorithms Selection Algorithms Sorting
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Selection finding, and its most common form median finding, are used as a measure of central tendency for problems in biology, databases, and graphics. These problems often require selection finding as a subcomponent where it can be called many times, and as such speed is important. The Map/Reduce framework has been shown to be an important tool for creating scalable applications. There are a number of valid implementations of the selection algorithms inside of a Map/Reduce framework, certain of which are compared in this paper. However, as the volume of data increases, subtle theoretical algorithmic implementation differences can lead to significant differences in practical application. Therefore, an efficient and scalable selection finding method has the potential to provide general benefit to a number of applications. This paper compares algorithms that have been redesigned or created for the Map/Reduce framework for the purpose of selection finding, or, finding the k-th ranked element in an unordered set. This paper takes the concepts used from two existing selection algorithms and translates them into a novel method using the Map/Reduce framework with two variations. Each approach uses a different methodology to reduce the total amount of workload needed for a selection. All the algorithms are compared together for scalability and efficiency in a computing cluster environment with up to 256 processing cores. The results show that the methods proposed in this paper outperform several common alternatives in identifying medians with Hadoop, including using sorting, Pig, and BinMedian methods. Our implementations are also available upon request.
DOI:	10.1109/BigData.2013.6691602