Method and apparatus for page resource structuring

The invention provides a method and an apparatus for page resource structuring, wherein the method comprises steps of: creating a web page content capturing module, acquiring a html file corresponding to a web page; defining a Schema file for standardizing an XML result document generated after stru...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: ZHU DANJIN, SHI HONGDUN, JIA LIQUN, WU QIJI, CHEN LIYONG, ZHOU YI, YI YINGHUA, CHENG YAN, ZHANG SHAOJIE, WENG ZHIXUAN, ZHOU JIANBAO, LIU YI, XIE DONGHUA, YANG WENHUA, DUAN XUEJIAN, HE YONG, HU DAWEI
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention provides a method and an apparatus for page resource structuring, wherein the method comprises steps of: creating a web page content capturing module, acquiring a html file corresponding to a web page; defining a Schema file for standardizing an XML result document generated after structuring; establishing a label mapping file, and according to a html label, a text property and a paragraph attribute, building a mapping with a label defined by the Schema; and performing content identification according to the mapping relation and generating a corresponding structured document, thereby completing structuring of page resource. The conventional web page data acquisition generally only relates to acquisition of web page metadata, and relative to the conventional processing method, the method and the apparatus provided by the invention can quickly, intelligently and accurately complete acquisition of the web page metadata and effective content, and can fragment and structure the acquired content, more