EGA:An Algorithm for Automatic Semi-structured Web Documents Extraction

With the fast expansion of World Wide Web, more and more semi-structured web documents appear on the web. In this paper, we study how to extract information from the semi-structured web documents by automatically generated wrappers. To automate the wrapper generation and the data extraction process,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Li, Liyu, Tang, Shiwei, Yang, Dongqing, Wang, Tengjiao, Su, Zhihua
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Applied sciences Computer science control theory systems Exact sciences and technology Genetic Algorithm Information Extraction Machine Learning Semi-structured Document Software XPath
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	With the fast expansion of World Wide Web, more and more semi-structured web documents appear on the web. In this paper, we study how to extract information from the semi-structured web documents by automatically generated wrappers. To automate the wrapper generation and the data extraction process, we develop a novel algorithm EGA (EPattern Generation Algorithm) to conduct the extraction pattern based on the local structural context features of the web documents. These optimal or near optimal extraction patterns are described in XPath language. Experimental results on RISE and our own data sets confirm the feasibility of our approach.
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-540-24571-1_69