Theme web portal crawler method

The invention relates to the technical field of network information capture, in particular to a topic portal website crawler method. The method comprises the following steps of: analyzing and extracting a webpage page link, and designing a regular expression according to a theme website so as to ide...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: XU JING, XU TIAN, BAO XIANYU, WEI TINGTING, LI YAN, HUANG DALIANG, ZHAO QINGYUE
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention relates to the technical field of network information capture, in particular to a topic portal website crawler method. The method comprises the following steps of: analyzing and extracting a webpage page link, and designing a regular expression according to a theme website so as to identify a parent page link and a child page link; webpage content extraction: extracting the text content under the sub-page link, and storing the extracted text content in a static class; the data persistence storage being used for storing the text content extracted from each sub-page link; and incremental capturing: capturing the updated content in the theme webpage, re-extracting the link of the home page of the theme webpage during each incremental updating, and only processing the new link. The page obtained through the crawler program is almost not repeated, the required theme can be accurately obtained, the webpage containing the same content can be effectively prevented from being downloaded for multiple time