ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts
Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are
a promising approach for increasing model capacity, demonstrating excellent
scalability across multiple domains. In this paper, we integrate the MoE
structure into the classic Vision Transformer (ViT), naming it ViMoE, and
explore the potential of applying MoE to vision through a comprehensive study
on image classification and semantic segmentation. However, we observe that the
performance is sensitive to the configuration of MoE layers, making it
challenging to obtain optimal results without careful design. The underlying
cause is that inappropriate MoE layers lead to unreliable routing and hinder
experts from effectively acquiring helpful information. To address this, we
introduce a shared expert to learn and capture common knowledge, serving as an
effective way to construct stable ViMoE. Furthermore, we demonstrate how to
analyze expert routing behavior, revealing which MoE layers are capable of
specializing in handling specific information and which are not. This provides
guidance for retaining the critical layers while removing redundancies, thereby
advancing ViMoE to be more efficient without sacrificing accuracy. We aspire
for this work to offer new insights into the design of vision MoE models and
provide valuable empirical guidance for future research. |
---|---|
DOI: | 10.48550/arxiv.2410.15732 |