MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
Scaling the size of a model enhances its capabilities but significantly increases computation complexity. Mixture-of-Experts models (MoE) address the issue by allowing model size to scale up without substantially increasing training or inference costs. In MoE, there is an important module called the...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Scaling the size of a model enhances its capabilities but significantly
increases computation complexity. Mixture-of-Experts models (MoE) address the
issue by allowing model size to scale up without substantially increasing
training or inference costs. In MoE, there is an important module called the
router, which is used to distribute each token to the experts. Currently, the
mainstream routing methods include dynamic routing and fixed routing. Despite
their promising results, MoE models encounter several challenges. Primarily,
for dynamic routing methods, the dispersion of training tokens across multiple
experts can lead to underfitting, particularly for infrequent tokens.
Additionally, though fixed routing methods can mitigate that issue, they
compromise on the diversity of representations. In this paper, we propose
\textbf{MaskMoE}, a method designed to enhance token-level learning by
employing a routing \textbf{mask}ing technique within the
\textbf{M}ixture-\textbf{o}f-\textbf{E}xperts model. MaskMoE is capable of
maintaining representation diversity while achieving more comprehensive
training. Experimental results demonstrate that our method outperforms previous
dominant Mixture-of-Experts models in terms of both perplexity (PPL) and
downstream task performance. |
---|---|
DOI: | 10.48550/arxiv.2407.09816 |