Web page repetitive structure and URL feature based Deep Web data extraction

Noise interference in web pages and the demand for multiple sample pages are the key issues of Deep Web data extraction. In this paper, we propose a novel web page repetitive structure and URL feature based approach for Deep Web data extraction. It employs continuous repetitive tag region and simila...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Xingyi Li, Yanyan Kong, Huaji Shi
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Noise interference in web pages and the demand for multiple sample pages are the key issues of Deep Web data extraction. In this paper, we propose a novel web page repetitive structure and URL feature based approach for Deep Web data extraction. It employs continuous repetitive tag region and similar URL to partition the sample page into blocks, locate the data region and extract specific URL template, which is further exploited to quickly identify the data region and the boundary of data records in similar pages. Experimental results show that our approach is highly effective for Deep Web data extraction.
DOI:10.1109/ICCSNA.2010.5588744