AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion param...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this work, we demonstrate that multilingual large-scale
sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising
and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners
than decoder-only models on various tasks. In particular, we train a 20 billion
parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B)
and show that it achieves state-of-the-art (SOTA) performance on 1-shot
summarization tasks, outperforming a much larger 540B PaLM decoder model.
AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for
low-resource languages, across almost all language pairs supported by the model
(Arabic, English, French, German, Hindi, Italian, Japanese, Marathi,
Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in
zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2
datasets and provides SOTA performance on multilingual tasks such as XNLI,
XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling case
for seq2seq models as a powerful alternative to decoder-only models for
Large-scale Language Model (LLM) training. |
---|---|
DOI: | 10.48550/arxiv.2208.01448 |