MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services
While modern internet services, such as chatbots, search engines, and online advertising, demand the use of large-scale deep neural networks (DNNs), distributed training and inference over heterogeneous computing systems are desired to facilitate these DNN models. Mixture-of-Experts (MoE) is one the...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | While modern internet services, such as chatbots, search engines, and online
advertising, demand the use of large-scale deep neural networks (DNNs),
distributed training and inference over heterogeneous computing systems are
desired to facilitate these DNN models. Mixture-of-Experts (MoE) is one the
most common strategies to lower the cost of training subject to the overall
size of models/data through gating and parallelism in a divide-and-conquer
fashion. While DeepSpeed has made efforts in carrying out large-scale MoE
training over heterogeneous infrastructures, the efficiency of training and
inference could be further improved from several system aspects, including load
balancing, communication/computation efficiency, and memory footprint limits.
In this work, we present a novel MoESys that boosts efficiency in both
large-scale training and inference. Specifically, in the training procedure,
the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch
and Fusion communication over Hierarchical storage, so as to enjoy efficient
parallelisms. For scalable inference in a single node, especially when the
model size is larger than GPU memory, MoESys builds the CPU-GPU memory jointly
into a ring of sections to load the model, and executes the computation tasks
across the memory sections in a round-robin manner for efficient inference. We
carried out extensive experiments to evaluate MoESys, where MoESys successfully
trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated
Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The
comparison against the state-of-the-art shows that MoESys outperformed
DeepSpeed with 33% higher throughput (tokens per second) in training and 13%
higher throughput in inference in general. Particularly, under unbalanced MoE
Tasks, e.g., UFO, MoESys achieved 64% higher throughput with 18% lower memory
footprints. |
---|---|
DOI: | 10.48550/arxiv.2205.10034 |