N-gram Analysis of Everyday Russian Speech: in Search of Multiword Units

Based on a statistical analysis of transcripts from everyday spoken Russian recordings, the presented research aims to search for stable multiword units. These units encompass a diverse set of multiword elements, bridging various linguistic phenomena such as compounds, idioms, colligations, collocat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the XXth Conference of Open Innovations Association FRUCT 2024-04, Vol.35 (2), p.838-https://youtu.be/9hsujc039pM
Hauptverfasser: Tatiana Sherstinova, Olga Markovich
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Based on a statistical analysis of transcripts from everyday spoken Russian recordings, the presented research aims to search for stable multiword units. These units encompass a diverse set of multiword elements, bridging various linguistic phenomena such as compounds, idioms, colligations, collocations, collostructions, and multiword named entities. The n-gram analysis technique facilitates the identification of these units by capturing the most recurrent word sequences. Data for this research was sourced from the transcribed part of the ORD corpus, known as “One Speech Day”, containing about 1,000,000 tokens. Captured using a continuous recording method with voluntary participants in natural conversational environments, this corpus is a best resource to study daily Russian dialogues. An examination of the top 500 bigrams and trigrams led to their categorization and the discernment of the most prevailing stable multiword units. These insights bear considerable relevance to NLP challenges centered on spontaneous Russian speech processing (primarily, for speech recognition tasks) as well as for teaching Russian as a second language.
ISSN:2305-7254
2343-0737
DOI:10.5281/zenodo.11096963