Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
We present two multilingual LLMs designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of exis...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present two multilingual LLMs designed to embrace Europe's linguistic
diversity by supporting all 24 official languages of the European Union.
Trained on a dataset comprising around 60% non-English data and utilizing a
custom multilingual tokenizer, our models address the limitations of existing
LLMs that predominantly focus on English or a few high-resource languages. We
detail the models' development principles, i.e., data composition, tokenizer
optimization, and training methodologies. The models demonstrate competitive
performance across multilingual benchmarks, as evidenced by their performance
on European versions of ARC, HellaSwag, MMLU, and TruthfulQA. |
---|---|
DOI: | 10.48550/arxiv.2410.03730 |