Feature reduction using mixture model of directional distributions
Text data normally has thousands, or even tens of thousands, of features. This causes the well-known ldquocurse of dimensionalityrdquo in text clustering. Feature reduction techniques have been proposed to address this problem by transforming the text data into much lower dimension, and improving cl...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Text data normally has thousands, or even tens of thousands, of features. This causes the well-known ldquocurse of dimensionalityrdquo in text clustering. Feature reduction techniques have been proposed to address this problem by transforming the text data into much lower dimension, and improving clustering performance. On the other hand, also due to the high dimensional characteristic of text, cosine similarity has been proven to be more suitable than Euclidean distance metric. This suggests modeling text as directional data. In this paper, we propose a novel feature reduction method based on probabilistic mixture model of directional distributions. Empirical results on various benchmark datasets show that our method performs comparably with latent semantic analysis (LSA), and much better than standard methods such as document frequency (DF) and term contribution (TC). |
---|---|
DOI: | 10.1109/ICARCV.2008.4795874 |