A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Zhao, Yucheng, Wang, Guangting, Tang, Chuanxin, Luo, Chong, Zeng, Wenjun, Zha, Zheng-Jun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Zhao, Yucheng
Wang, Guangting
Tang, Chuanxin
Luo, Chong
Zeng, Wenjun
Zha, Zheng-Jun
description Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.
doi_str_mv 10.48550/arxiv.2108.13002
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2108_13002</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2108_13002</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-f3f58ceeb289c9cf584d2d6d80b1933e03786534c599e3133b104d7b22b03e1e3</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQofwAp_QBNs3zxsdiEqUCmklZp95NjXUkQelZMA_Xvawmo0mqORDiEPnIWRjGP2pP1P-xUKzmTIgTFxS3YZfdHz3CEdHS1x_h79Jz3MfjHz4nF6ptlAN_2x9a3R3XlY7OlC5mW5ppXXw-RG36NfUz1Y-lHs78iN092E9_-5ItXrpsrfg2L3ts2zItBJKgIHLpYGsRFSGWXOJbLCJlayhisAZJDKJIbIxEohcICGs8imjRANA-QIK_L4d3s1qo--7bU_1Rez-moGv5zQRtM</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP</title><source>arXiv.org</source><creator>Zhao, Yucheng ; Wang, Guangting ; Tang, Chuanxin ; Luo, Chong ; Zeng, Wenjun ; Zha, Zheng-Jun</creator><creatorcontrib>Zhao, Yucheng ; Wang, Guangting ; Tang, Chuanxin ; Luo, Chong ; Zeng, Wenjun ; Zha, Zheng-Jun</creatorcontrib><description>Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.</description><identifier>DOI: 10.48550/arxiv.2108.13002</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2021-08</creationdate><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2108.13002$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2108.13002$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhao, Yucheng</creatorcontrib><creatorcontrib>Wang, Guangting</creatorcontrib><creatorcontrib>Tang, Chuanxin</creatorcontrib><creatorcontrib>Luo, Chong</creatorcontrib><creatorcontrib>Zeng, Wenjun</creatorcontrib><creatorcontrib>Zha, Zheng-Jun</creatorcontrib><title>A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP</title><description>Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQofwAp_QBNs3zxsdiEqUCmklZp95NjXUkQelZMA_Xvawmo0mqORDiEPnIWRjGP2pP1P-xUKzmTIgTFxS3YZfdHz3CEdHS1x_h79Jz3MfjHz4nF6ptlAN_2x9a3R3XlY7OlC5mW5ppXXw-RG36NfUz1Y-lHs78iN092E9_-5ItXrpsrfg2L3ts2zItBJKgIHLpYGsRFSGWXOJbLCJlayhisAZJDKJIbIxEohcICGs8imjRANA-QIK_L4d3s1qo--7bU_1Rez-moGv5zQRtM</recordid><startdate>20210830</startdate><enddate>20210830</enddate><creator>Zhao, Yucheng</creator><creator>Wang, Guangting</creator><creator>Tang, Chuanxin</creator><creator>Luo, Chong</creator><creator>Zeng, Wenjun</creator><creator>Zha, Zheng-Jun</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210830</creationdate><title>A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP</title><author>Zhao, Yucheng ; Wang, Guangting ; Tang, Chuanxin ; Luo, Chong ; Zeng, Wenjun ; Zha, Zheng-Jun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-f3f58ceeb289c9cf584d2d6d80b1933e03786534c599e3133b104d7b22b03e1e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Yucheng</creatorcontrib><creatorcontrib>Wang, Guangting</creatorcontrib><creatorcontrib>Tang, Chuanxin</creatorcontrib><creatorcontrib>Luo, Chong</creatorcontrib><creatorcontrib>Zeng, Wenjun</creatorcontrib><creatorcontrib>Zha, Zheng-Jun</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhao, Yucheng</au><au>Wang, Guangting</au><au>Tang, Chuanxin</au><au>Luo, Chong</au><au>Zeng, Wenjun</au><au>Zha, Zheng-Jun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP</atitle><date>2021-08-30</date><risdate>2021</risdate><abstract>Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.</abstract><doi>10.48550/arxiv.2108.13002</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2108.13002
ispartof
issn
language eng
recordid cdi_arxiv_primary_2108_13002
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T21%3A48%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Battle%20of%20Network%20Structures:%20An%20Empirical%20Study%20of%20CNN,%20Transformer,%20and%20MLP&rft.au=Zhao,%20Yucheng&rft.date=2021-08-30&rft_id=info:doi/10.48550/arxiv.2108.13002&rft_dat=%3Carxiv_GOX%3E2108_13002%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true