Delving Deep into the Generalization of Vision Transformers under Distribution Shifts

Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first p...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2022-03
Hauptverfasser:	Zhang, Chongzhi, Zhang, Mingyuan, Zhang, Shanghang, Jin, Daisheng, Zhou, Qiang, Cai, Zhongang, Zhao, Haiyu, Liu, Xianglong, Liu, Ziwei
Format:	Artikel
Sprache:	eng
Schlagworte:	Information theory Parameter sensitivity Supervised learning Taxonomy Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Zhang, Chongzhi Zhang, Mingyuan Zhang, Shanghang Jin, Daisheng Zhou, Qiang Cai, Zhongang Zhao, Haiyu Liu, Xianglong Liu, Ziwei
description	Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2541124114</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2541124114</sourcerecordid><originalsourceid>FETCH-proquest_journals_25411241143</originalsourceid><addsrcrecordid>eNqNissKwjAQRYMgWLT_MOC60Katurc-9la3EnFip9SkZhIXfr1V_AAXh3vgnpGIZJ5nyaqQciJi5jZNU7lYyrLMI3GssHuSuUGF2AMZb8E3CDs06FRHL-XJGrAaTsQfq50yrK27o2MI5ooOKmLv6BK-5aEh7Xkmxlp1jPFvp2K-3dTrfdI7-wjI_tza4MxwnWVZZJkcKPL_qjfrBkEs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2541124114</pqid></control><display><type>article</type><title>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</title><source>Free E- Journals</source><creator>Zhang, Chongzhi ; Zhang, Mingyuan ; Zhang, Shanghang ; Jin, Daisheng ; Zhou, Qiang ; Cai, Zhongang ; Zhao, Haiyu ; Liu, Xianglong ; Liu, Ziwei</creator><creatorcontrib>Zhang, Chongzhi ; Zhang, Mingyuan ; Zhang, Shanghang ; Jin, Daisheng ; Zhou, Qiang ; Cai, Zhongang ; Zhao, Haiyu ; Liu, Xianglong ; Liu, Ziwei</creatorcontrib><description>Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Information theory ; Parameter sensitivity ; Supervised learning ; Taxonomy ; Training</subject><ispartof>arXiv.org, 2022-03</ispartof><rights>2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Zhang, Chongzhi</creatorcontrib><creatorcontrib>Zhang, Mingyuan</creatorcontrib><creatorcontrib>Zhang, Shanghang</creatorcontrib><creatorcontrib>Jin, Daisheng</creatorcontrib><creatorcontrib>Zhou, Qiang</creatorcontrib><creatorcontrib>Cai, Zhongang</creatorcontrib><creatorcontrib>Zhao, Haiyu</creatorcontrib><creatorcontrib>Liu, Xianglong</creatorcontrib><creatorcontrib>Liu, Ziwei</creatorcontrib><title>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</title><title>arXiv.org</title><description>Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.</description><subject>Information theory</subject><subject>Parameter sensitivity</subject><subject>Supervised learning</subject><subject>Taxonomy</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNissKwjAQRYMgWLT_MOC60Katurc-9la3EnFip9SkZhIXfr1V_AAXh3vgnpGIZJ5nyaqQciJi5jZNU7lYyrLMI3GssHuSuUGF2AMZb8E3CDs06FRHL-XJGrAaTsQfq50yrK27o2MI5ooOKmLv6BK-5aEh7Xkmxlp1jPFvp2K-3dTrfdI7-wjI_tza4MxwnWVZZJkcKPL_qjfrBkEs</recordid><startdate>20220308</startdate><enddate>20220308</enddate><creator>Zhang, Chongzhi</creator><creator>Zhang, Mingyuan</creator><creator>Zhang, Shanghang</creator><creator>Jin, Daisheng</creator><creator>Zhou, Qiang</creator><creator>Cai, Zhongang</creator><creator>Zhao, Haiyu</creator><creator>Liu, Xianglong</creator><creator>Liu, Ziwei</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20220308</creationdate><title>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</title><author>Zhang, Chongzhi ; Zhang, Mingyuan ; Zhang, Shanghang ; Jin, Daisheng ; Zhou, Qiang ; Cai, Zhongang ; Zhao, Haiyu ; Liu, Xianglong ; Liu, Ziwei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25411241143</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Information theory</topic><topic>Parameter sensitivity</topic><topic>Supervised learning</topic><topic>Taxonomy</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Chongzhi</creatorcontrib><creatorcontrib>Zhang, Mingyuan</creatorcontrib><creatorcontrib>Zhang, Shanghang</creatorcontrib><creatorcontrib>Jin, Daisheng</creatorcontrib><creatorcontrib>Zhou, Qiang</creatorcontrib><creatorcontrib>Cai, Zhongang</creatorcontrib><creatorcontrib>Zhao, Haiyu</creatorcontrib><creatorcontrib>Liu, Xianglong</creatorcontrib><creatorcontrib>Liu, Ziwei</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Chongzhi</au><au>Zhang, Mingyuan</au><au>Zhang, Shanghang</au><au>Jin, Daisheng</au><au>Zhou, Qiang</au><au>Cai, Zhongang</au><au>Zhao, Haiyu</au><au>Liu, Xianglong</au><au>Liu, Ziwei</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</atitle><jtitle>arXiv.org</jtitle><date>2022-03-08</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2022-03
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2541124114
source	Free E- Journals
subjects	Information theory Parameter sensitivity Supervised learning Taxonomy Training
title	Delving Deep into the Generalization of Vision Transformers under Distribution Shifts
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T19%3A25%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Delving%20Deep%20into%20the%20Generalization%20of%20Vision%20Transformers%20under%20Distribution%20Shifts&rft.jtitle=arXiv.org&rft.au=Zhang,%20Chongzhi&rft.date=2022-03-08&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2541124114%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2541124114&rft_id=info:pmid/&rfr_iscdi=true