ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices
Neural Architecture Search (NAS) has shown promising performance in the automatic design of vision transformers (ViT) exceeding 1G FLOPs. However, designing lightweight and low-latency ViT models for diverse mobile devices remains a big challenge. In this work, we propose ElasticViT, a two-stage NAS...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Neural Architecture Search (NAS) has shown promising performance in the
automatic design of vision transformers (ViT) exceeding 1G FLOPs. However,
designing lightweight and low-latency ViT models for diverse mobile devices
remains a big challenge. In this work, we propose ElasticViT, a two-stage NAS
approach that trains a high-quality ViT supernet over a very large search space
that supports a wide range of mobile devices, and then searches an optimal
sub-network (subnet) for direct deployment. However, prior supernet training
methods that rely on uniform sampling suffer from the gradient conflict issue:
the sampled subnets can have vastly different model sizes (e.g., 50M vs. 2G
FLOPs), leading to different optimization directions and inferior performance.
To address this challenge, we propose two novel sampling techniques:
complexity-aware sampling and performance-aware sampling. Complexity-aware
sampling limits the FLOPs difference among the subnets sampled across adjacent
training steps, while covering different-sized subnets in the search space.
Performance-aware sampling further selects subnets that have good accuracy,
which can reduce gradient conflicts and improve supernet quality. Our
discovered models, ElasticViT models, achieve top-1 accuracy from 67.2% to
80.0% on ImageNet from 60M to 800M FLOPs without extra retraining,
outperforming all prior CNNs and ViTs in terms of accuracy and latency. Our
tiny and small models are also the first ViT models that surpass
state-of-the-art CNNs with significantly lower latency on mobile devices. For
instance, ElasticViT-S1 runs 2.62x faster than EfficientNet-B0 with 0.1% higher
accuracy. |
---|---|
DOI: | 10.48550/arxiv.2303.09730 |