EGA:An Algorithm for Automatic Semi-structured Web Documents Extraction
With the fast expansion of World Wide Web, more and more semi-structured web documents appear on the web. In this paper, we study how to extract information from the semi-structured web documents by automatically generated wrappers. To automate the wrapper generation and the data extraction process,...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | With the fast expansion of World Wide Web, more and more semi-structured web documents appear on the web. In this paper, we study how to extract information from the semi-structured web documents by automatically generated wrappers. To automate the wrapper generation and the data extraction process, we develop a novel algorithm EGA (EPattern Generation Algorithm) to conduct the extraction pattern based on the local structural context features of the web documents. These optimal or near optimal extraction patterns are described in XPath language. Experimental results on RISE and our own data sets confirm the feasibility of our approach. |
---|---|
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-540-24571-1_69 |