A comparative study on term weighting schemes for text categorization

The term weighting scheme, which is used to convert documents into vectors in the term spaces, is a vital step in automatic text categorization. The previous studies showed that term weighting schemes dominate the performance rather than the kernel functions of SVMs for the text categorization task....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Man Lan, Sam-Yuan Sung, Hwee-Boon Low, Chew-Lim Tan
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The term weighting scheme, which is used to convert documents into vectors in the term spaces, is a vital step in automatic text categorization. The previous studies showed that term weighting schemes dominate the performance rather than the kernel functions of SVMs for the text categorization task. In this paper, we conducted experiments to compare various term weighting schemes with SVM on two widely-used benchmark data sets. We also presented a new term weighting scheme tf.rf for text categorization. The cross-scheme comparison was performed by using McNemar's tests. The controlled experimental results showed that the newly proposed tf.rf scheme is significantly better than other term weighting schemes. Compared with schemes related with tf factor alone, the idf factor does not improve or even decrease the term's discriminating power for text categorization. The binary and tf.chi representations significantly underperform the other term weighting schemes.
ISSN:2161-4393
2161-4407
DOI:10.1109/IJCNN.2005.1555890