Web Spam Detection by Learning from Small Labeled Samples

Web spamming tries to deceive search engines to rank some pages higher than they deserve. Many methods have been proposed to combat web spamming and to detect spam pages. One basic method is using classification, i. e. , learning a classification model from previously labeled training data and using...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of computer applications 2012-01, Vol.50 (21), p.1-5
Hauptverfasser:	Karimpour, Jaber, Noroozi, Ali A, Alizadeh, Somayeh
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Classification Learning Search engines Spamming Training Websites World Wide Web
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Web spamming tries to deceive search engines to rank some pages higher than they deserve. Many methods have been proposed to combat web spamming and to detect spam pages. One basic method is using classification, i. e. , learning a classification model from previously labeled training data and using this model for classifying web pages to spam or non-spam. A drawback of this method is that manually labeling a large number of web pages to generate the training data can be biased, non-accurate, labor intensive and time consuming. In this paper, we are going to propose a new method to resolve this drawback by using semi-supervised learning to automatically label the training data. To do this, we incorporate Expectation-Maximization algorithm that is an efficient and an important algorithm of semi-supervised learning. Experiments are carried out on the real web spam data, which show the new method, performs very well in practice.
ISSN:	0975-8887 0975-8887
DOI:	10.5120/7924-0993