N-gram Analysis of Everyday Russian Speech: in Search of Multiword Units
Based on a statistical analysis of transcripts from everyday spoken Russian recordings, the presented research aims to search for stable multiword units. These units encompass a diverse set of multiword elements, bridging various linguistic phenomena such as compounds, idioms, colligations, collocat...
Gespeichert in:
Veröffentlicht in: | Proceedings of the XXth Conference of Open Innovations Association FRUCT 2024-04, Vol.35 (2), p.838-https://youtu.be/9hsujc039pM |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Based on a statistical analysis of transcripts from everyday spoken Russian recordings, the presented research aims to search for stable multiword units. These units encompass a diverse set of multiword elements, bridging various linguistic phenomena such as compounds, idioms, colligations, collocations, collostructions, and multiword named entities. The n-gram analysis technique facilitates the identification of these units by capturing the most recurrent word sequences. Data for this research was sourced from the transcribed part of the ORD corpus, known as “One Speech Day”, containing about 1,000,000 tokens. Captured using a continuous recording method with voluntary participants in natural conversational environments, this corpus is a best resource to study daily Russian dialogues. An examination of the top 500 bigrams and trigrams led to their categorization and the discernment of the most prevailing stable multiword units. These insights bear considerable relevance to NLP challenges centered on spontaneous Russian speech processing (primarily, for speech recognition tasks) as well as for teaching Russian as a second language. |
---|---|
ISSN: | 2305-7254 2343-0737 |
DOI: | 10.5281/zenodo.11096963 |