Entropy-based automated wrapper generation for weblog data extraction

This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that expl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	World wide web (Bussum) 2014-07, Vol.17 (4), p.827-846
Hauptverfasser:	Gkotsis, George, Stepanyan, Karen, Cristea, Alexandra I., Joy, Mike
Format:	Artikel
Sprache:	eng
Schlagworte:	Automated Collection Computer Science Database Management Extraction HTML HyperText Markup Language Information retrieval Information Systems Applications (incl.Internet) Mathematical models Methodology Operating Systems Probability theory
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.
ISSN:	1386-145X 1573-1413
DOI:	10.1007/s11280-013-0269-6