A Study on Automatic Web Pages Categorization

Since the Internet has become a huge repository of information, many studies address the issue of web pages categorization. For web page classification, we want to find a subset of words which help to discriminate between different kinds of web pages, so we introduced feature selection. In this pape...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Sun Bo, Sun Qiurui, Chen Zhong, Fu Zengmei
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Since the Internet has become a huge repository of information, many studies address the issue of web pages categorization. For web page classification, we want to find a subset of words which help to discriminate between different kinds of web pages, so we introduced feature selection. In this paper, we study some feature selection methods such as ReliefF and Symmetrical Uncertainty. Also, the high dimensional text vocabulary space is one of the main challenges of web pages, we used Hidden Naive Bayes, Complement class Naive Bayes and other traditional techniques for web page classification. Results on benchmark dataset show that the abilities of HNB perform more satisfying than other methods and SU is more competitive than ReliefF for relevant words selection in web pages categorization.
DOI:10.1109/IADCC.2009.4809225