Webpage text positioning method and device based on sequence labeling and computer equipment
The invention relates to a webpage text positioning method and device based on sequence labeling and computer equipment. Firstly, a regular expression is constructed to extract corresponding text segments from source codes, due to the fact that only a small part of the extracted segments contain tex...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The invention relates to a webpage text positioning method and device based on sequence labeling and computer equipment. Firstly, a regular expression is constructed to extract corresponding text segments from source codes, due to the fact that only a small part of the extracted segments contain text content to be extracted, primary classification is carried out on the segments, and the segments really containing the text content are screened out according to the difference between the text segments and non-text segments. Secondly, performing assignment and initialization on all parameters to be used in the HMM model by utilizing the training set; and finally, calculating the probability that each word in the text fragment belongs to different labels by utilizing a viterbi algorithm, selecting the maximum probability to carry out sequence labeling, and positioning all contents belonging to the text according to the category of the labels.
本申请涉及一种基于序列标注的网页正文定位方法、装置和计算机设备。首先构建正则表达式从源码中提取出相应的文本片段,由于提取出的片段中仅有小部分包 |
---|