A template-based method for theme information extraction from web pages

The introducing web page templates and DOM technology can effectively extract simple structured information from web information. In reference to previous research achievements of the foundation, this paper presents a new method of inductive web page templates. This method is able to contain various...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Gui-Sheng Yin, Guang-Dong Guo, Jing-Jing Sun
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The introducing web page templates and DOM technology can effectively extract simple structured information from web information. In reference to previous research achievements of the foundation, this paper presents a new method of inductive web page templates. This method is able to contain various layout elements of the web page templates. The main research contents include the methods based on edit distance, about DOM document similarity judgment, clustering methods focus on web structure, the extraction methods of web page templates and programming a information extraction engine.
ISSN:2161-9069
DOI:10.1109/ICCASM.2010.5620763