A data-based classification of Slavic languages: Indices of qualitative variation applied to grapheme frequencies
The Ord's graph is a simple graphical method for displaying frequency distributions of data or theoretical distributions in the two-dimensional plane. Its coordinates are proportions of the first three moments, either empirical or theoretical ones. A modification of the Ord's graph based o...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The Ord's graph is a simple graphical method for displaying frequency
distributions of data or theoretical distributions in the two-dimensional
plane. Its coordinates are proportions of the first three moments, either
empirical or theoretical ones. A modification of the Ord's graph based on
proportions of indices of qualitative variation is presented. Such a
modification makes the graph applicable also to data of categorical character.
In addition, the indices are normalized with values between 0 and 1, which
enables comparing data files divided into different numbers of categories. Both
the original and the new graph are used to display grapheme frequencies in
eleven Slavic languages. As the original Ord's graph requires an assignment of
numbers to the categories, graphemes were ordered decreasingly according to
their frequencies. Data were taken from parallel corpora, i.e., we work with
grapheme frequencies from a Russian novel and its translations to ten other
Slavic languages. Then, cluster analysis is applied to the graph coordinates.
While the original graph yields results which are not linguistically
interpretable, the modification reveals meaningful relations among the
languages. |
---|---|
DOI: | 10.48550/arxiv.1504.03608 |