Theme web portal crawler method
The invention relates to the technical field of network information capture, in particular to a topic portal website crawler method. The method comprises the following steps of: analyzing and extracting a webpage page link, and designing a regular expression according to a theme website so as to ide...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The invention relates to the technical field of network information capture, in particular to a topic portal website crawler method. The method comprises the following steps of: analyzing and extracting a webpage page link, and designing a regular expression according to a theme website so as to identify a parent page link and a child page link; webpage content extraction: extracting the text content under the sub-page link, and storing the extracted text content in a static class; the data persistence storage being used for storing the text content extracted from each sub-page link; and incremental capturing: capturing the updated content in the theme webpage, re-extracting the link of the home page of the theme webpage during each incremental updating, and only processing the new link. The page obtained through the crawler program is almost not repeated, the required theme can be accurately obtained, the webpage containing the same content can be effectively prevented from being downloaded for multiple time |
---|