Automation of archival documents morphological tagging (Preprint)

The paper provides the description of the add-on to the stemming tool MyStem by I. Segalovich. We designe the application to add the MyStem a convenient graphical interface that is easy to learn and intuitive for users who do not specialize in information technology. It turned out that MyStem correc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Mathematical Physics and Computer Modeling 2019-08, Vol.22 No (4)
Hauptverfasser: Komendantov, Anatoly S, Matveev, Alexander G, Svetlov, Andrey V, Vol 22 No 4 2019, pp 53-63
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The paper provides the description of the add-on to the stemming tool MyStem by I. Segalovich. We designe the application to add the MyStem a convenient graphical interface that is easy to learn and intuitive for users who do not specialize in information technology. It turned out that MyStem correctly processes outdated vocabulary if it is passed into the program using modern Cyrillic. In addition to the convenient interface, our program has the option to work with the outdated Cyrillic alphabet, when turned on, for instance, the letters zelo and omega are being replaced by «ks» and «o» respectively, and only then the text is transferring for analysis to MyStem, and then the characters are being replaced back in the processed document. So our add-on intercepts the output of the MyStem tool, reformatts and analyzes it in a special way. In addition, the application has functionality for removing homonyms manually if the program was not correct with automatic tagging the morphological characteristics of a word. The main purpose of this application is to prepare the morphological tagging of documents of the archival fund «Mikhailovsky Stanichny Ataman» to create a linguistic corpus. During the work on the application, we solved the problem with the correct processing of texts containing outdated Cyrillic characters. To implement the functional and user-friendly graphical interface, we use the JavaFX platform (OpenJFX).
ISSN:2587-6325
2587-6902
DOI:10.15688/mpcm.jvolsu.2019.4.4