Data cleaning method and system, computer equipment and storage medium
The invention discloses a data cleaning method and system, computer equipment and a storage medium, and the method comprises the steps: obtaining the topic probability distribution of a document: carrying out the topic modeling of a collected original corpus through a topic model, and obtaining the...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Patent |
Sprache: | chi ; eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The invention discloses a data cleaning method and system, computer equipment and a storage medium, and the method comprises the steps: obtaining the topic probability distribution of a document: carrying out the topic modeling of a collected original corpus through a topic model, and obtaining the topic probability distribution of each document; a topic probability distribution clustering step: obtaining a document sample set and/or outlier noise points through a clustering algorithm according to the topic probability distribution; and an irrelevant text deleting step: deleting the outlier noise points. The data cleaning method based on topic clustering has the advantages that the automation degree is high, the topic category of the text is automatically recognized through the topic clustering method, automatic filtering of content-independent texts can be achieved, and universality is good.
本申请公开了本发明提供了一种数据清洗方法、系统、计算机设备及存储介质,数据清洗方法包括:获得文档主题概率分布步骤:使用主题模型对收集到的原始语料进行主题建模,得到各文档的主题概率分布;主题概率分布聚类步骤:根据所述主题概率分布通过聚类算 |
---|