Composite Feature Extraction and Selection for Text Classification

Although words are basic semantic units in text, phrases, and expressions contain additional information, which is important for text classification. To capture this information, traditional algorithms extract composite features via word sequences or co-occurrences, such as bigrams and termsets, but...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2019, Vol.7, p.35208-35219
Hauptverfasser: Wan, Chuan, Wang, Yuling, Liu, Yaoze, Ji, Jinchao, Feng, Guozhong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Although words are basic semantic units in text, phrases, and expressions contain additional information, which is important for text classification. To capture this information, traditional algorithms extract composite features via word sequences or co-occurrences, such as bigrams and termsets, but ignore the influence of stop words and punctuation, which results in huge amounts of weak features. In this paper, we propose a text structure-based algorithm to extract composite features. Termsets that cross punctuation marks or stop words in the text are excluded. To eliminate redundancy, a novel discriminative measure containing two factors is suggested. One is employed to measure the relevancy, while the other is incorporated to increase the values of composite features, whose class frequencies are much smaller than those of their sub-features. The experiments on three benchmark datasets with both a support vector machine and a naive Bayes classifier illustrate the effectiveness of the approach.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2019.2904602