Data chaos: An entropy based MapReduce framework for scalable learning

Chaos of data is the total unpredictability of all the data elements, and can by quantified by Shannon entropy. In this paper, we firstly propose an entropy based theoretic framework for machine learning, which states that chaos in sample data will decrease and rule will advance as learning progress...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Jiaoyan Chen, Huajun Chen, Xi Chen, Guozhou Zheng, Zhaohui Wu
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Chaos of data is the total unpredictability of all the data elements, and can by quantified by Shannon entropy. In this paper, we firstly propose an entropy based theoretic framework for machine learning, which states that chaos in sample data will decrease and rule will advance as learning progresses. However, it is usually time consuming to apply the theoretic framework because groups of rule need to be trained iteratively and data chaos will be recalculated during each iteration. To implement the theoretic framework for scalable learning, we propose a MapReduce based distributed computational framework. In a case study of classification, the framework parallelly trains multiple classifiers and calculats chaos of the sample set during each iteration, and then resamples a small sample subset with the highest entropy for training of the next iteration, reducing chaos in sample data as quickly as possible. With typical classification benchmarks, our experiment presents entropy in sample data, and proves that the theoretic framework is rational and can help improve the accuracy of machine learning. Meanwhile, the computational framework shows high performance including high efficiency and scalability for large scale learning on hadoop cluster.
DOI:10.1109/BigData.2013.6691736