Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning
The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized sub-models optimizes overall performance with a constant computational cost. However, conventional MoEs pose challenges at scale due to the need to store all experts in memory. In this paper, we push...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The Mixture of Experts (MoE) is a widely known neural architecture where an
ensemble of specialized sub-models optimizes overall performance with a
constant computational cost. However, conventional MoEs pose challenges at
scale due to the need to store all experts in memory. In this paper, we push
MoE to the limit. We propose extremely parameter-efficient MoE by uniquely
combining MoE architecture with lightweight experts.Our MoE architecture
outperforms standard parameter-efficient fine-tuning (PEFT) methods and is on
par with full fine-tuning by only updating the lightweight experts -- less than
1% of an 11B parameters model. Furthermore, our method generalizes to unseen
tasks as it does not depend on any prior task knowledge. Our research
underscores the versatility of the mixture of experts architecture, showcasing
its ability to deliver robust performance even when subjected to rigorous
parameter constraints. Our code used in all the experiments is publicly
available here: https://github.com/for-ai/parameter-efficient-moe. |
---|---|
DOI: | 10.48550/arxiv.2309.05444 |