Cross-token Modeling with Conditional Computation
Mixture-of-Experts (MoE), a conditional computation architecture, achieved promising performance by scaling local module (i.e. feed-forward network) of transformer. However, scaling the cross-token module (i.e. self-attention) is challenging due to the unstable training. This work proposes Sparse-ML...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Mixture-of-Experts (MoE), a conditional computation architecture, achieved
promising performance by scaling local module (i.e. feed-forward network) of
transformer. However, scaling the cross-token module (i.e. self-attention) is
challenging due to the unstable training. This work proposes Sparse-MLP, an
all-MLP model which applies sparsely-activated MLPs to cross-token modeling.
Specifically, in each Sparse block of our all-MLP model, we apply two stages of
MoE layers: one with MLP experts mixing information within channels along image
patch dimension, the other with MLP experts mixing information within patches
along the channel dimension. In addition, by proposing importance-score routing
strategy for MoE and redesigning the image representation shape, we further
improve our model's computational efficiency. Experimentally, we are more
computation-efficient than Vision Transformers with comparable accuracy. Also,
our models can outperform MLP-Mixer by 2.5\% on ImageNet Top-1 accuracy with
fewer parameters and computational cost. On downstream tasks, i.e. Cifar10 and
Cifar100, our models can still achieve better performance than baselines. |
---|---|
DOI: | 10.48550/arxiv.2109.02008 |