A two-stage discretization algorithm based on information entropy

Discretization is an important and difficult preprocessing task for data mining and knowledge discovery. Although there are numerous discretization approaches, many suffer from certain drawbacks. Local approaches are efficient, but their generalization ability is weak. Global approaches consider all...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied intelligence (Dordrecht, Netherlands) Netherlands), 2017-12, Vol.47 (4), p.1169-1185
Hauptverfasser:	Wen, Liu-Ying, Min, Fan, Wang, Shi-Yuan
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial Intelligence Computer Science Data mining Discretization Entropy Entropy (Information theory) Machines Manufacturing Mechanical Engineering Preprocessing Processes
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Discretization is an important and difficult preprocessing task for data mining and knowledge discovery. Although there are numerous discretization approaches, many suffer from certain drawbacks. Local approaches are efficient, but their generalization ability is weak. Global approaches consider all attributes simultaneously, but they have high time and space complexities. In this paper, we propose a two-stage discretization (TSD) algorithm based on information entropy. In the local discretization stage, we independently select k strong cuts for each attribute to minimize conditional entropy. The goal is to rapidly reduce the cardinality of the attributes, with minor information loss. In the global discretization stage, cuts for all attributes are considered simultaneously to form a scaled decision system. The minimal cut set that preserves the positive region is finally selected. We tested the new algorithm and seven popular algorithms on 28 datasets. Compared with other approaches, our algorithm has the best generalization ability, with a good information preserving ability, the highest classification accuracy, and reasonable time consumption.
ISSN:	0924-669X 1573-7497
DOI:	10.1007/s10489-017-0941-0