LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

Larger transformer models always perform better on various tasks but require more costs to scale up the model size. To efficiently enlarge models, the mixture-of-experts (MoE) architecture is widely adopted, which consists of a gate network and a series of experts and keep the training cost constant...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Nie, Xiaonan, Liu, Qibin, Fu, Fangcheng, Zhu, Shenhan, Miao, Xupeng, Li, Xiaoyang, Zhang, Yang, Liu, Shouda, Cui, Bin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!