How data partitioning strategies and subset size influence the performance of an ensemble?
When dealing with big data, "divide and conquer" is the most commonly used strategy in practice to partition a big dataset into such smaller subsets that each subset can be handled by a computer or a node of cluster or cloud computing systems. However, among many existing partitioning or s...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | When dealing with big data, "divide and conquer" is the most commonly used strategy in practice to partition a big dataset into such smaller subsets that each subset can be handled by a computer or a node of cluster or cloud computing systems. However, among many existing partitioning or sampling techniques, it is not clear which one is suitable and how the size of subset may affect the performance of further analysis. In this paper, after presenting a generic framework of ensemble approach for learning from big data, we focus our investigations on systematically evaluating the effect of partitioning strategies and subset size on ensemble performance. The experimental results have demonstrated that three investigated partitioning / sampling strategies behaved statistically similar but the subset size may affect the performance of the ensemble in very drastically different ways, which are grouped into three patterns, rather than just one default perception - the bigger the better. |
---|---|
DOI: | 10.1109/BigData.2013.6691732 |