How data partitioning strategies and subset size influence the performance of an ensemble?

When dealing with big data, "divide and conquer" is the most commonly used strategy in practice to partition a big dataset into such smaller subsets that each subset can be handled by a computer or a node of cluster or cloud computing systems. However, among many existing partitioning or s...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Farrash, Majed, Wenjia Wang
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:When dealing with big data, "divide and conquer" is the most commonly used strategy in practice to partition a big dataset into such smaller subsets that each subset can be handled by a computer or a node of cluster or cloud computing systems. However, among many existing partitioning or sampling techniques, it is not clear which one is suitable and how the size of subset may affect the performance of further analysis. In this paper, after presenting a generic framework of ensemble approach for learning from big data, we focus our investigations on systematically evaluating the effect of partitioning strategies and subset size on ensemble performance. The experimental results have demonstrated that three investigated partitioning / sampling strategies behaved statistically similar but the subset size may affect the performance of the ensemble in very drastically different ways, which are grouped into three patterns, rather than just one default perception - the bigger the better.
DOI:10.1109/BigData.2013.6691732