MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We pr...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | State Space Models (SSMs) have become serious contenders in the field of
sequential modeling, challenging the dominance of Transformers. At the same
time, Mixture of Experts (MoE) has significantly improved Transformer-based
Large Language Models, including recent state-of-the-art open models. We
propose that to unlock the potential of SSMs for scaling, they should be
combined with MoE. We showcase this on Mamba, a recent SSM-based model that
achieves remarkable performance. Our model, MoE-Mamba, outperforms both Mamba
and baseline Transformer-MoE. In particular, MoE-Mamba reaches the same
performance as Mamba in $2.35\times$ fewer training steps while preserving the
inference performance gains of Mamba against Transformer. |
---|---|
DOI: | 10.48550/arxiv.2401.04081 |