"Vorbe\c{s}ti Rom\^ane\c{s}te?" A Recipe to Train Powerful Romanian LLMs with English Instructions
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are t...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In recent years, Large Language Models (LLMs) have achieved almost human-like
performance on various tasks. While some LLMs have been trained on multilingual
data, most of the training data is in English; hence, their performance in
English greatly exceeds other languages. To our knowledge, we are the first to
collect and translate a large collection of texts, instructions, and benchmarks
and train, evaluate, and release open-source LLMs tailored for Romanian. We
evaluate our methods on four different categories, including academic
benchmarks, MT-Bench (manually translated), and a professionally built
historical, cultural, and social benchmark adapted to Romanian. We argue for
the usefulness and high performance of RoLLMs by obtaining state-of-the-art
results across the board. We publicly release all resources (i.e., data,
training and evaluation code, models) to support and encourage research on
Romanian LLMs while concurrently creating a generalizable recipe, adequate for
other low or less-resourced languages. |
---|---|
DOI: | 10.48550/arxiv.2406.18266 |