A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhao, Yucheng, Wang, Guangting, Tang, Chuanxin, Luo, Chong, Zeng, Wenjun, Zha, Zheng-Jun
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zhao, Yucheng Wang, Guangting Tang, Chuanxin Luo, Chong Zeng, Wenjun Zha, Zheng-Jun
description	Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.
doi_str_mv	10.48550/arxiv.2108.13002
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2108_13002</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2108_13002</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-f3f58ceeb289c9cf584d2d6d80b1933e03786534c599e3133b104d7b22b03e1e3</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQofwAp_QBNs3zxsdiEqUCmklZp95NjXUkQelZMA_Xvawmo0mqORDiEPnIWRjGP2pP1P-xUKzmTIgTFxS3YZfdHz3CEdHS1x_h79Jz3MfjHz4nF6ptlAN_2x9a3R3XlY7OlC5mW5ppXXw-RG36NfUz1Y-lHs78iN092E9_-5ItXrpsrfg2L3ts2zItBJKgIHLpYGsRFSGWXOJbLCJlayhisAZJDKJIbIxEohcICGs8imjRANA-QIK_L4d3s1qo--7bU_1Rez-moGv5zQRtM</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP</title><source>arXiv.org</source><creator>Zhao, Yucheng ; Wang, Guangting ; Tang, Chuanxin ; Luo, Chong ; Zeng, Wenjun ; Zha, Zheng-Jun</creator><creatorcontrib>Zhao, Yucheng ; Wang, Guangting ; Tang, Chuanxin ; Luo, Chong ; Zeng, Wenjun ; Zha, Zheng-Jun</creatorcontrib><description>Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.</description><identifier>DOI: 10.48550/arxiv.2108.13002</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2021-08</creationdate><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2108.13002$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2108.13002$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhao, Yucheng</creatorcontrib><creatorcontrib>Wang, Guangting</creatorcontrib><creatorcontrib>Tang, Chuanxin</creatorcontrib><creatorcontrib>Luo, Chong</creatorcontrib><creatorcontrib>Zeng, Wenjun</creatorcontrib><creatorcontrib>Zha, Zheng-Jun</creatorcontrib><title>A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP</title><description>Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQofwAp_QBNs3zxsdiEqUCmklZp95NjXUkQelZMA_Xvawmo0mqORDiEPnIWRjGP2pP1P-xUKzmTIgTFxS3YZfdHz3CEdHS1x_h79Jz3MfjHz4nF6ptlAN_2x9a3R3XlY7OlC5mW5ppXXw-RG36NfUz1Y-lHs78iN092E9_-5ItXrpsrfg2L3ts2zItBJKgIHLpYGsRFSGWXOJbLCJlayhisAZJDKJIbIxEohcICGs8imjRANA-QIK_L4d3s1qo--7bU_1Rez-moGv5zQRtM</recordid><startdate>20210830</startdate><enddate>20210830</enddate><creator>Zhao, Yucheng</creator><creator>Wang, Guangting</creator><creator>Tang, Chuanxin</creator><creator>Luo, Chong</creator><creator>Zeng, Wenjun</creator><creator>Zha, Zheng-Jun</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210830</creationdate><title>A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP</title><author>Zhao, Yucheng ; Wang, Guangting ; Tang, Chuanxin ; Luo, Chong ; Zeng, Wenjun ; Zha, Zheng-Jun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-f3f58ceeb289c9cf584d2d6d80b1933e03786534c599e3133b104d7b22b03e1e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Yucheng</creatorcontrib><creatorcontrib>Wang, Guangting</creatorcontrib><creatorcontrib>Tang, Chuanxin</creatorcontrib><creatorcontrib>Luo, Chong</creatorcontrib><creatorcontrib>Zeng, Wenjun</creatorcontrib><creatorcontrib>Zha, Zheng-Jun</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhao, Yucheng</au><au>Wang, Guangting</au><au>Tang, Chuanxin</au><au>Luo, Chong</au><au>Zeng, Wenjun</au><au>Zha, Zheng-Jun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP</atitle><date>2021-08-30</date><risdate>2021</risdate><abstract>Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.</abstract><doi>10.48550/arxiv.2108.13002</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2108.13002
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2108_13002
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T21%3A48%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Battle%20of%20Network%20Structures:%20An%20Empirical%20Study%20of%20CNN,%20Transformer,%20and%20MLP&rft.au=Zhao,%20Yucheng&rft.date=2021-08-30&rft_id=info:doi/10.48550/arxiv.2108.13002&rft_dat=%3Carxiv_GOX%3E2108_13002%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true