An adaptive bottom up clustering approach for Web news extraction

An adaptive bottom up Web news extraction approach based on human perception is presented in this paper. The approach simulates how a human perceives and identifies Web news information by using an adaptive bottom up clustering strategy to detect possible news areas. It first detects news areas base...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Jinlin Chen, Shankar, S., Kelly, A., Gningue, S., Rajaravivarma, R.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:An adaptive bottom up Web news extraction approach based on human perception is presented in this paper. The approach simulates how a human perceives and identifies Web news information by using an adaptive bottom up clustering strategy to detect possible news areas. It first detects news areas based on content function, space continuity, and formatting continuity of news information. It further identifies detailed news content based on the position, format, and semantic of detected news areas. Experiment results show that our approach achieves much better performance (in average more than 99% in terms of F1 Value) compared to previous approaches such as tree edit distance and visual wrapper based approaches. Furthermore, our approach does not assume the existence of Web templates in the tested Web pages as required by tree edit distance based approach, nor does it need training sets as required in Visual Wrapper based approach. The success of our approach demonstrates the strength of the perception based Web information extraction methodology and represents a promising approach for automatic information extraction from sources with presentation design for humans.
ISSN:2379-1268
2379-1276
DOI:10.1109/WOCC.2009.5312904