HTML-content-page issuing time extraction method and system

The invention provides a HTML-content-page issuing time extraction method and system. The method includes the following steps that HTML is analyzed, and HTML segments of date-form text are obtained; according to manually-annotated dates, positive samples and negative samples are determined and selec...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: WU DONGYE, XIA JING, ZHENG YEPING, FENG DAHUI
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention provides a HTML-content-page issuing time extraction method and system. The method includes the following steps that HTML is analyzed, and HTML segments of date-form text are obtained; according to manually-annotated dates, positive samples and negative samples are determined and selected, and according to the positive samples and the negative samples, a label database is automatically generated; samples in the label database are subjected to vector transformation, and character representation is generated; a SVM model is trained through the character representation; the character representation of to-be-predicted HTML is sent into the trained SVM model and predicted, and if a predicted value is positive, it is judged that the character representation is issuing time of the to-be-predicted HTML. According to the HTML-content-page issuing time extraction method and system, the label database can be automatically generated from HTML, the method and system get rid of dependence of natural language,