Ancient Greek language models

In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values. Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Stopponi, Pedrazzini, Peels-Matthey, McGillivray, Nissim
Format:	Dataset
Sprache:	eng
Schlagworte:	Ancient Greek count-based graph-based syntactic embeddings language models word embeddings word vector representations
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values. Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with. [Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica. [ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006 Diachronica models Training data Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for: Classical subcorpus Hellenistic subcorpus Whole corpus Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus). Models Count-based Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection) a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75. b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Word2Vec Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade). a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20. b. Skipgram with Negative S
DOI:	10.5281/zenodo.8369515