A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Convolutional neural networks (CNN) are the dominant deep neural network
(DNN) architecture for computer vision. Recently, Transformer and multi-layer
perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer,
started to lead new trends as they showed promising results in the ImageNet
classification task. In this paper, we conduct empirical studies on these DNN
structures and try to understand their respective pros and cons. To ensure a
fair comparison, we first develop a unified framework called SPACH which adopts
separate modules for spatial and channel processing. Our experiments under the
SPACH framework reveal that all structures can achieve competitive performance
at a moderate scale. However, they demonstrate distinctive behaviors when the
network size scales up. Based on our findings, we propose two hybrid models
using convolution and Transformer modules. The resulting Hybrid-MS-S+ model
achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is
already on par with the SOTA models with sophisticated designs. The code and
models are publicly available at https://github.com/microsoft/SPACH. |
---|---|
DOI: | 10.48550/arxiv.2108.13002 |