Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts
Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine t...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Weight-sharing supernets are crucial for performance estimation in
cutting-edge neural architecture search (NAS) frameworks. Despite their ability
to generate diverse subnetworks without retraining, the quality of these
subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine
translation and pre-trained language modeling, there is a significant
performance gap between supernet and training from scratch for the same model
architecture, necessitating retraining post optimal architecture
identification.
This study introduces a solution called mixture-of-supernets, a generalized
supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet
model expressiveness with minimal training overhead. Unlike conventional
supernets, this method employs an architecture-based routing mechanism,
enabling indirect sharing of model weights among subnetworks. This
customization of weights for specific architectures, learned through gradient
descent, minimizes retraining time, significantly enhancing training efficiency
in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS
for fast machine translation models, exhibiting a superior latency-BLEU
tradeoff compared to HAT, the SoTA NAS framework for machine translation.
Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT
models, surpassing NAS-BERT and AutoDistil across various model sizes. The code
can be found at: https://github.com/UBC-NLP/MoS. |
---|---|
DOI: | 10.48550/arxiv.2306.04845 |