Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
Vocabulary adaptation, which integrates new vocabulary into pre-trained language models (LMs), enables expansion to new languages and mitigates token over-fragmentation. However, existing approaches are limited by their reliance on heuristic or external embeddings. We propose VocADT, a novel method...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Vocabulary adaptation, which integrates new vocabulary into pre-trained
language models (LMs), enables expansion to new languages and mitigates token
over-fragmentation. However, existing approaches are limited by their reliance
on heuristic or external embeddings. We propose VocADT, a novel method for
vocabulary adaptation using adapter modules that are trained to learn the
optimal linear combination of existing embeddings while keeping the model's
weights fixed. VocADT offers a flexible and scalable solution without requiring
external resources or language constraints. Across 11 languages-with various
scripts, resource availability, and fragmentation-we demonstrate that VocADT
outperforms the original Mistral model and other baselines across various
multilingual tasks. We find that Latin-script languages and highly fragmented
languages benefit the most from vocabulary adaptation. We further fine-tune the
adapted model on the generative task of machine translation and find that
vocabulary adaptation is still beneficial after fine-tuning and that VocADT is
the most effective method. |
---|---|
DOI: | 10.48550/arxiv.2410.09644 |