PanGu-$\alpha$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning....
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large-scale Pretrained Language Models (PLMs) have become the new paradigm
for Natural Language Processing (NLP). PLMs with hundreds of billions
parameters such as GPT-3 have demonstrated strong performances on natural
language understanding and generation with \textit{few-shot in-context}
learning. In this work, we present our practice on training large-scale
autoregressive language models named PanGu-$\alpha$, with up to 200 billion
parameters. PanGu-$\alpha$ is developed under the MindSpore and trained on a
cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is
implemented based on MindSpore Auto-parallel, which composes five parallelism
dimensions to scale the training task to 2048 processors efficiently, including
data parallelism, op-level model parallelism, pipeline model parallelism,
optimizer model parallelism and rematerialization. To enhance the
generalization ability of PanGu-$\alpha$, we collect 1.1TB high-quality Chinese
data from a wide range of domains to pretrain the model. We empirically test
the generation ability of PanGu-$\alpha$ in various scenarios including text
summarization, question answering, dialogue generation, etc. Moreover, we
investigate the effect of model scales on the few-shot performances across a
broad range of Chinese NLP tasks. The experimental results demonstrate the
superior capabilities of PanGu-$\alpha$ in performing various tasks under
few-shot or zero-shot settings. |
---|---|
DOI: | 10.48550/arxiv.2104.12369 |