LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

Larger transformer models always perform better on various tasks but require more costs to scale up the model size. To efficiently enlarge models, the mixture-of-experts (MoE) architecture is widely adopted, which consists of a gate network and a series of experts and keep the training cost constant...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Nie, Xiaonan, Liu, Qibin, Fu, Fangcheng, Zhu, Shenhan, Miao, Xupeng, Li, Xiaoyang, Zhang, Yang, Liu, Shouda, Cui, Bin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Distributed, Parallel, and Cluster Computing
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!