Impact of training corpus size on the quality of different types of language models for Serbian
This paper describes a study on correspondence between the language model quality and the size of the textual corpus used in the training process. Three types of n-gram models developed for the Serbian language were included in the study: word-based, lemma-based and class-based model. They are creat...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper describes a study on correspondence between the language model quality and the size of the textual corpus used in the training process. Three types of n-gram models developed for the Serbian language were included in the study: word-based, lemma-based and class-based model. They are created in order to deal with the data sparsity problem which is very expressed because of the high degree of inflection of the Serbian language. The three model types were trained on corpora of different sizes and evaluated by perplexity on authentic text and text with random word order in order to obtain the discrimination coefficients values. These values show different degrees of robustness of the three model types to data sparsity problem and indicate a way of combining these models in order to achieve the best language representation for a given training corpus. |
---|---|
DOI: | 10.1109/TELFOR.2012.6419309 |