Scaling Laws for Neural Machine Translation
We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present an empirical study of scaling properties of encoder-decoder
Transformer models used in neural machine translation (NMT). We show that
cross-entropy loss as a function of model size follows a certain scaling law.
Specifically (i) We propose a formula which describes the scaling behavior of
cross-entropy loss as a bivariate function of encoder and decoder size, and
show that it gives accurate predictions under a variety of scaling approaches
and languages; we show that the total number of parameters alone is not
sufficient for such purposes. (ii) We observe different power law exponents
when scaling the decoder vs scaling the encoder, and provide recommendations
for optimal allocation of encoder/decoder capacity based on this observation.
(iii) We also report that the scaling behavior of the model is acutely
influenced by composition bias of the train/test sets, which we define as any
deviation from naturally generated text (either via machine generated or human
translated text). We observe that natural text on the target side enjoys
scaling, which manifests as successful reduction of the cross-entropy loss.
(iv) Finally, we investigate the relationship between the cross-entropy loss
and the quality of the generated translations. We find two different behaviors,
depending on the nature of the test data. For test sets which were originally
translated from target language to source language, both loss and BLEU score
improve as model size increases. In contrast, for test sets originally
translated from source language to target language, the loss improves, but the
BLEU score stops improving after a certain threshold. We release generated text
from all models used in this study. |
---|---|
DOI: | 10.48550/arxiv.2109.07740 |