A comparative study on term weighting schemes for text categorization
The term weighting scheme, which is used to convert documents into vectors in the term spaces, is a vital step in automatic text categorization. The previous studies showed that term weighting schemes dominate the performance rather than the kernel functions of SVMs for the text categorization task....
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The term weighting scheme, which is used to convert documents into vectors in the term spaces, is a vital step in automatic text categorization. The previous studies showed that term weighting schemes dominate the performance rather than the kernel functions of SVMs for the text categorization task. In this paper, we conducted experiments to compare various term weighting schemes with SVM on two widely-used benchmark data sets. We also presented a new term weighting scheme tf.rf for text categorization. The cross-scheme comparison was performed by using McNemar's tests. The controlled experimental results showed that the newly proposed tf.rf scheme is significantly better than other term weighting schemes. Compared with schemes related with tf factor alone, the idf factor does not improve or even decrease the term's discriminating power for text categorization. The binary and tf.chi representations significantly underperform the other term weighting schemes. |
---|---|
ISSN: | 2161-4393 2161-4407 |
DOI: | 10.1109/IJCNN.2005.1555890 |