METHOD AND SYSTEM FOR DETECTING DUPLICATED DOCUMENT USING VECTOR QUANTIZATION

Disclosed are a method and system for detecting a duplicate document using a vector quantization. The method for detecting the duplicate document according to one embodiment may comprise: a step of acquiring a vector representation for each of the documents included in a document set through a simil...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: HAN BYEONGHOON, KIM SUNG MIN
Format: Patent
Sprache:eng ; kor
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Disclosed are a method and system for detecting a duplicate document using a vector quantization. The method for detecting the duplicate document according to one embodiment may comprise: a step of acquiring a vector representation for each of the documents included in a document set through a similarity model trained so as to output a vector representation for the documents based on a meaningful similarity between the documents; a step of generating a key implemented as a binary string by vector quantizing the vector expression; and a step of detecting a duplicate document among the documents included in the document set through the key. Therefore, the present invention is capable of determining whether or not there are duplicates between the documents. 벡터 양자화를 이용한 중복 문서 탐지 방법 및 시스템을 개시한다. 일실시예에 따른 중복 문서 탐지 방법은 문서들간의 의미적 유사도에 기반하여 문서들에 대한 벡터 표현을 출력하도록 학습된 유사도 모델을 통해 문서 집합에 포함된 문서들 각각에 대한 벡터 표현을 획득하는 단계, 상기 벡터 표현을 벡터 양자화하여 이진 문자열로 구현되는 키를 생성하는 단계 및 상기 키를 통해 상기 문서 집합에 포함된 문서들 중 중복 문서를 탐지하는 단계를 포함할 수 있다.