Method and apparatus for page resource structuring
The invention provides a method and an apparatus for page resource structuring, wherein the method comprises steps of: creating a web page content capturing module, acquiring a html file corresponding to a web page; defining a Schema file for standardizing an XML result document generated after stru...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The invention provides a method and an apparatus for page resource structuring, wherein the method comprises steps of: creating a web page content capturing module, acquiring a html file corresponding to a web page; defining a Schema file for standardizing an XML result document generated after structuring; establishing a label mapping file, and according to a html label, a text property and a paragraph attribute, building a mapping with a label defined by the Schema; and performing content identification according to the mapping relation and generating a corresponding structured document, thereby completing structuring of page resource. The conventional web page data acquisition generally only relates to acquisition of web page metadata, and relative to the conventional processing method, the method and the apparatus provided by the invention can quickly, intelligently and accurately complete acquisition of the web page metadata and effective content, and can fragment and structure the acquired content, more |
---|