Method for quickly classifying webpage topics based on HTML source code features

The invention discloses a method for quickly classifying webpage topics based on HTML source code features. According to the method, the image data containing the webpage layout characteristics are obtained by automatically analyzing the webpage source code, and the characteristics can effectively r...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: ZHU YUJIA, JIAN XIAOYUN, CHEN JINHUI, YANG ZHE, WANG LIFANG
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses a method for quickly classifying webpage topics based on HTML source code features. According to the method, the image data containing the webpage layout characteristics are obtained by automatically analyzing the webpage source code, and the characteristics can effectively reflect the layout information of the webpage by selecting the content length and the link length contained in the tags, the hierarchical relationship to which the selected tags belong and the distance relationship between the selected tags. The image data generated by the webpage source code is trained through a deep learning model to obtain webpage layout features contained in the image data, thereby achieving the purpose of quickly and accurately classifying massive webpages by using the webpage layout features. According to the method, the webpage layout information contained in the webpage source code is effectively utilized, the layout information is automatically extracted and learned,and the constructed clas