REBIEX: Record Boundary Identification and Extraction Through Pattern Mining

Information on the web is often placed in a structure having a particular alignment and order. For example, Web pages produced by Web search engines, CGI scripts, etc generally have multiple records of information, with each record representing one unit of information and share a distinct visual pat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Kulkarni, Parashuram
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Information on the web is often placed in a structure having a particular alignment and order. For example, Web pages produced by Web search engines, CGI scripts, etc generally have multiple records of information, with each record representing one unit of information and share a distinct visual pattern. The pattern formed by these records may be in the structure of documents or in the repetitive nature of their content. For effective information extraction it becomes essential to identify record boundaries for these units of information and apply extraction rules on individual record elements. In this paper I present REBIEX, a system to automatically identify and extract repeated patterns formed by the data records in a fuzzy way, allowing for slight inconsistencies using the structural elements of web documents as well as the content and categories of text elements in the documents without the need of any training data or human intervention. This technique, unlike the current ones makes use of the fact that it is not only HTML structure which repeats, but also the content matter of the document which repeats consistently. The system also employs a novel algorithm to mine repeating patterns in a fuzzy way with high accuracy.
ISSN:0302-9743
1611-3349
DOI:10.1007/11581062_65