METHOD AND SYSTEM FOR OBTAINING A VECTOR REPRESENTATION OF AN ELECTRONIC DOCUMENT

The invention relates to the field of computer technology for processing natural language, artificial language and any semiotic systems. The present computer-implemented method is carried out with the aid of a processor and comprises the steps of: generating a cluster-based m-skip-n-gram location mo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	VYSHEGORODTSEV, Kirill Evgenievich, BALASHOV, Aleksandr Viktorovich, RYUPICHEV, Dmitriy Yurievich, DAVIDOV, Dmitriy Georgievich
Format:	Patent
Sprache:	eng ; fre ; rus
Schlagworte:	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The invention relates to the field of computer technology for processing natural language, artificial language and any semiotic systems. The present computer-implemented method is carried out with the aid of a processor and comprises the steps of: generating a cluster-based m-skip-n-gram location model, where an m-skip-n-gram is an individual word and the following is done during the generation of said model: a list of m-skip-n-grams to be used is determined; each m-skip-n-gram in the list is converted into a vector representation; and the m-skip-n-grams are clustered according to their vector representations; and processing a text document with the aid of the resulting m-skip-n-gram location model, during which the following is done: the occurrence of m-skip-n-grams in the text document is counted; clusters are identified in the text document on the basis of the occurrence of m-skip-n-grams; the number of occurrences of m-skip-n-grams in each cluster is totalled; and a vector representation of the text document is generated on the basis of an ordered sequence of the m-skip-n-gram totals. The technical result consists in providing more accurate representation of text data in a vector format by using vector representations of word m-skip-n-grams and by using same for the subsequent clusterization of a text document in order to convert the document into a vector form. L'invention se rapporte au domaine des techniques informatiques afin de traiter un langage naturel, un langage artificiel et de quelconques systèmes sémiotiques. Ce procédé mis en oeuvre par ordinateur est exécuté à l'aide d'un processeur, et le procédé comprend les étapes suivantes: générer un modèle de disposition de m-skip-n-grammes en fonction de groupes, dans lequel un m-skip-n-gramme consiste en un mot distinct; effectuer lors de la génération dudit modèle: une détermination de la liste des m-skip-n-grammes à utiliser; convertir chaque m-skip-n-gramme de la liste en une représentation vectorielle; regrouper les m-skip-n-grammes en fonction de leurs représentations vectorielles; effectuer un traitement du document texte à l'aide du modèle obtenu de disposition des m-skip-n-grammes au cours duquel: on effectue: un décompte d'occurrence des m-skip-n-grammes dans le document texte; on détermine les groupes de document texte sur la base de l'occurrence de m-skip-n-grammes; on additionne le nombre d'occurrences de m-skip-n-grammes à partir de chaque groupe; on génère une représentation vectorie