Webpage crawling method and device

The embodiment of the invention discloses a webpage crawling method and device. The method comprises the steps of obtaining a topic correlation determination model obtained by pre-training a target topic, and determining at least one candidate link; crawling webpage brief introductions of candidate...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	HUANG MIAOHUA, LI HENGXIN, LI GAOBIAO, CHEN YONGDONG, LUO HUIXING, CHEN GUIQIAO, PAN SHUAICHEN, ZHANG JIAN, ZENG MINGHUI, ZHENG WEIZHI, ZHANG ZHENG, GU SIYANG, PENG DONGMING, LEE YOUNG-BIN, LIAO WEIRONG, XIE SHUYONG, LIU XIAOZHOU, LIN SHAN, ZHAO GUIZHONG
Format:	Patent
Sprache:	chi ; eng
Schlagworte:	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The embodiment of the invention discloses a webpage crawling method and device. The method comprises the steps of obtaining a topic correlation determination model obtained by pre-training a target topic, and determining at least one candidate link; crawling webpage brief introductions of candidate webpages corresponding to the at least one candidate link, and determining at least one theme link related to the target theme in the at least one candidate link based on the webpage brief introductions and the theme correlation determination model; and crawling webpage contents of at least one theme webpage corresponding to the at least one theme link. According to the technical scheme provided by the embodiment of the invention, the problem of topic drift existing in crawling of the webpage content related to the topic is solved. 本发明实施例公开了一种网页爬取方法及装置。该方法包括：获取针对目标主题预先训练得到的主题相关性确定模型，并确定至少一个候选链接；爬取所述至少一个候选链接分别对应的候选网页的网页简介，并基于所述网页简介和所述主题相关性确定模型，确定所述至少一个候选链接中，与所述目标主题相关的至少一个主题链接；对所述至少一个主题链接对应的至少一个主题网页的网页内容进行爬取。本发明实施例的技