HTML-content-page issuing time extraction method and system
The invention provides a HTML-content-page issuing time extraction method and system. The method includes the following steps that HTML is analyzed, and HTML segments of date-form text are obtained; according to manually-annotated dates, positive samples and negative samples are determined and selec...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The invention provides a HTML-content-page issuing time extraction method and system. The method includes the following steps that HTML is analyzed, and HTML segments of date-form text are obtained; according to manually-annotated dates, positive samples and negative samples are determined and selected, and according to the positive samples and the negative samples, a label database is automatically generated; samples in the label database are subjected to vector transformation, and character representation is generated; a SVM model is trained through the character representation; the character representation of to-be-predicted HTML is sent into the trained SVM model and predicted, and if a predicted value is positive, it is judged that the character representation is issuing time of the to-be-predicted HTML. According to the HTML-content-page issuing time extraction method and system, the label database can be automatically generated from HTML, the method and system get rid of dependence of natural language, |
---|