Improving Multilingual Models with Language-Clustered Vocabularies
State-of-the-art multilingual models depend on vocabularies that cover all of the languages the model will expect to see at inference time, but the standard methods for generating those vocabularies are not ideal for massively multilingual applications. In this work, we introduce a novel procedure f...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | State-of-the-art multilingual models depend on vocabularies that cover all of
the languages the model will expect to see at inference time, but the standard
methods for generating those vocabularies are not ideal for massively
multilingual applications. In this work, we introduce a novel procedure for
multilingual vocabulary generation that combines the separately trained
vocabularies of several automatically derived language clusters, thus balancing
the trade-off between cross-lingual subword sharing and language-specific
vocabularies. Our experiments show improvements across languages on key
multilingual benchmark tasks TyDi QA (+2.9 F1), XNLI (+2.1\%), and WikiAnn NER
(+2.8 F1) and factor of 8 reduction in out-of-vocabulary rate, all without
increasing the size of the model or data. |
---|---|
DOI: | 10.48550/arxiv.2010.12777 |