Feature selection algorithm for text classification based on improved mutual information

In order to solve the poor performance in text classification when using traditional formula of mutual information (MI) , a feature selection algorithm were proposed based on improved mutual information. The improved mutual information algorithm, which is on the basis of traditional improved mutual...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:哈尔滨工业大学学报(英文版) 2011-06, Vol.18 (3), p.144-148
1. Verfasser: 丛帅 张积宾 徐志明 王宇颖
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In order to solve the poor performance in text classification when using traditional formula of mutual information (MI) , a feature selection algorithm were proposed based on improved mutual information. The improved mutual information algorithm, which is on the basis of traditional improved mutual information methods that enbance the MI value of negative characteristics and feature' s frequency, supports the concept of concentration degree and dispersion degree. In accordance with the concept of concentration degree and dispersion degree, formulas which embody concentration degree and dispersion degree were constructed and the improved mutual information was implemented based on these. In this paper, the feature selection algorithm was applied based on improved mutual information to a text classifier based on Biomimetic Pattern Recognition and it was compared with several other feature selection methods. The experimental results showed that the improved mutu- al information feature selection method greatly enhances the performance compared with traditional mutual information feature selection methods and the performance is better than that of information gain. Through the introduction of the concept of concentration degree and dispersion degree, the improved mutual information feature selection method greatly improves the performance of text classification system.
ISSN:1005-9113