Improving Large Language Model Russian Adaptation with Preliminary Vocabulary Optimization
Most of Large Language Model (LLM) text comprehension capabilities come from generative pre-training on large corpora which includes texts of different domains, languages and tasks. As a consequence the LLM performance in a specific language depends on its representation in the training data which f...
Gespeichert in:
Veröffentlicht in: | Lobachevskii journal of mathematics 2024-07, Vol.45 (7), p.3211-3219 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Most of Large Language Model (LLM) text comprehension capabilities come from generative pre-training on large corpora which includes texts of different domains, languages and tasks. As a consequence the LLM performance in a specific language depends on its representation in the training data which for most state-of-the-art models was biased towards English language. The issue is commonly alleviated by further pre-training on the target language, however, due to limited model capacity this often results in knowledge forgetting and text understanding degradation. We argue that the performance drop can be avoided by employing parameter-efficient tuning methods that preserve the integrity of the original model. In this work, we investigate the effectiveness of different vocabulary optimization and adapter tuning schemes for LLM Russian adaptation. Our experimental results with Solar-10.7B LLM show that language adaptation process can be substantially accelerated by transferring the embeddings from smaller language-tuned counterparts. Moreover, we find that preliminary vocabulary optimization stabilizes further adapter-tuning thus improving target language generalization. By applying our two-stage language adaptation approach we obtain state-of-the-art results on Russian Super Glue and MMLU-RU language understanding datasets for sub-30B parameter open-source LLMs. |
---|---|
ISSN: | 1995-0802 1818-9962 |
DOI: | 10.1134/S1995080224604120 |