SERENGETI: Massively Multilingual Language Models for Africa
Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Multilingual pretrained language models (mPLMs) acquire valuable,
generalizable linguistic information during pretraining and have advanced the
state of the art on task-specific finetuning. To date, only ~31 out of ~2,000
African languages are covered in existing language models. We ameliorate this
limitation by developing SERENGETI, a massively multilingual language model
that covers 517 African languages and language varieties. We evaluate our novel
models on eight natural language understanding tasks across 20 datasets,
comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms
other models on 11 datasets across the eights tasks, achieving 82.27 average
F_1. We also perform analyses of errors from our models, which allows us to
investigate the influence of language genealogy and linguistic similarity when
the models are applied under zero-shot settings. We will publicly release our
models for
research.\footnote{\href{https://github.com/UBC-NLP/serengeti}{https://github.com/UBC-NLP/serengeti}} |
---|---|
DOI: | 10.48550/arxiv.2212.10785 |