Data cleaning method and system, computer equipment and storage medium

The invention discloses a data cleaning method and system, computer equipment and a storage medium, and the method comprises the steps: obtaining the topic probability distribution of a document: carrying out the topic modeling of a collected original corpus through a topic model, and obtaining the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: YOU YING, WEI HAITIAN
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses a data cleaning method and system, computer equipment and a storage medium, and the method comprises the steps: obtaining the topic probability distribution of a document: carrying out the topic modeling of a collected original corpus through a topic model, and obtaining the topic probability distribution of each document; a topic probability distribution clustering step: obtaining a document sample set and/or outlier noise points through a clustering algorithm according to the topic probability distribution; and an irrelevant text deleting step: deleting the outlier noise points. The data cleaning method based on topic clustering has the advantages that the automation degree is high, the topic category of the text is automatically recognized through the topic clustering method, automatic filtering of content-independent texts can be achieved, and universality is good. 本申请公开了本发明提供了一种数据清洗方法、系统、计算机设备及存储介质,数据清洗方法包括:获得文档主题概率分布步骤:使用主题模型对收集到的原始语料进行主题建模,得到各文档的主题概率分布;主题概率分布聚类步骤:根据所述主题概率分布通过聚类算