$PanGu-$\alpha$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation$

PanGu-$\alpha$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zeng, Wei, Ren, Xiaozhe, Su, Teng, Wang, Hui, Liao, Yi, Wang, Zhiwei, Jiang, Xin, Yang, ZhenZhang, Wang, Kaisheng, Zhang, Xiaoda, Li, Chen, Gong, Ziyan, Yao, Yifan, Huang, Xinjing, Wang, Jun, Yu, Jianfeng, Guo, Qi, Yu, Yue, Zhang, Yan, Wang, Jin, Tao, Hengtao, Yan, Dasen, Yi, Zexuan, Peng, Fang, Jiang, Fangqing, Zhang, Han, Deng, Lingfeng, Zhang, Yehong, Lin, Zhe, Zhang, Chao, Zhang, Shaojie, Guo, Mingyue, Gu, Shanzhi, Fan, Gaojun, Wang, Yaowei, Jin, Xuefeng, Liu, Qun, Tian, Yonghong
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named PanGu-$\alpha$, with up to 200 billion parameters. PanGu-$\alpha$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-$\alpha$, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-$\alpha$ in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-$\alpha$ in performing various tasks under few-shot or zero-shot settings.
DOI:	10.48550/arxiv.2104.12369