Scalable Attribute-Value Extraction from Semi-structured Text

This paper describes a general methodology for extracting attribute-value pairs from Web pages. It consists of two phases: candidate generation, in which syntactically likely attribute-value pairs are annotated; and candidate filtering, in which semantically improbable annotations are removed. We de...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yuk Wah Wong, Widdows, D., Lokovic, T., Nigam, K.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Cloud computing Clustering algorithms Computer networks Conferences Costs Data mining Data processing Decision trees Machine learning algorithms Training data
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper describes a general methodology for extracting attribute-value pairs from Web pages. It consists of two phases: candidate generation, in which syntactically likely attribute-value pairs are annotated; and candidate filtering, in which semantically improbable annotations are removed. We describe three types of candidate generators and two types of candidate filters, all of which are designed to be massively parallelizable. Our methods can handle 1 billion Web pages in less than 6 hours with 1,000 machines. The best generator and filter combination achieves 70% F-measure compared to a hand-annotated corpus.
ISSN:	2375-9232 2375-9259
DOI:	10.1109/ICDMW.2009.81