Feature reduction using mixture model of directional distributions

Text data normally has thousands, or even tens of thousands, of features. This causes the well-known ldquocurse of dimensionalityrdquo in text clustering. Feature reduction techniques have been proposed to address this problem by transforming the text data into much lower dimension, and improving cl...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Nguyen Duc Thang, Lihui Chen, Chee Keong Chan
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Text data normally has thousands, or even tens of thousands, of features. This causes the well-known ldquocurse of dimensionalityrdquo in text clustering. Feature reduction techniques have been proposed to address this problem by transforming the text data into much lower dimension, and improving clustering performance. On the other hand, also due to the high dimensional characteristic of text, cosine similarity has been proven to be more suitable than Euclidean distance metric. This suggests modeling text as directional data. In this paper, we propose a novel feature reduction method based on probabilistic mixture model of directional distributions. Empirical results on various benchmark datasets show that our method performs comparably with latent semantic analysis (LSA), and much better than standard methods such as document frequency (DF) and term contribution (TC).
DOI:10.1109/ICARCV.2008.4795874