Delving Deep into the Generalization of Vision Transformers under Distribution Shifts

Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first p...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2022-03
Hauptverfasser: Zhang, Chongzhi, Zhang, Mingyuan, Zhang, Shanghang, Jin, Daisheng, Zhou, Qiang, Cai, Zhongang, Zhao, Haiyu, Liu, Xianglong, Liu, Ziwei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Zhang, Chongzhi
Zhang, Mingyuan
Zhang, Shanghang
Jin, Daisheng
Zhou, Qiang
Cai, Zhongang
Zhao, Haiyu
Liu, Xianglong
Liu, Ziwei
description Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2541124114</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2541124114</sourcerecordid><originalsourceid>FETCH-proquest_journals_25411241143</originalsourceid><addsrcrecordid>eNqNissKwjAQRYMgWLT_MOC60Katurc-9la3EnFip9SkZhIXfr1V_AAXh3vgnpGIZJ5nyaqQciJi5jZNU7lYyrLMI3GssHuSuUGF2AMZb8E3CDs06FRHL-XJGrAaTsQfq50yrK27o2MI5ooOKmLv6BK-5aEh7Xkmxlp1jPFvp2K-3dTrfdI7-wjI_tza4MxwnWVZZJkcKPL_qjfrBkEs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2541124114</pqid></control><display><type>article</type><title>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</title><source>Free E- Journals</source><creator>Zhang, Chongzhi ; Zhang, Mingyuan ; Zhang, Shanghang ; Jin, Daisheng ; Zhou, Qiang ; Cai, Zhongang ; Zhao, Haiyu ; Liu, Xianglong ; Liu, Ziwei</creator><creatorcontrib>Zhang, Chongzhi ; Zhang, Mingyuan ; Zhang, Shanghang ; Jin, Daisheng ; Zhou, Qiang ; Cai, Zhongang ; Zhao, Haiyu ; Liu, Xianglong ; Liu, Ziwei</creatorcontrib><description>Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Information theory ; Parameter sensitivity ; Supervised learning ; Taxonomy ; Training</subject><ispartof>arXiv.org, 2022-03</ispartof><rights>2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Zhang, Chongzhi</creatorcontrib><creatorcontrib>Zhang, Mingyuan</creatorcontrib><creatorcontrib>Zhang, Shanghang</creatorcontrib><creatorcontrib>Jin, Daisheng</creatorcontrib><creatorcontrib>Zhou, Qiang</creatorcontrib><creatorcontrib>Cai, Zhongang</creatorcontrib><creatorcontrib>Zhao, Haiyu</creatorcontrib><creatorcontrib>Liu, Xianglong</creatorcontrib><creatorcontrib>Liu, Ziwei</creatorcontrib><title>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</title><title>arXiv.org</title><description>Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.</description><subject>Information theory</subject><subject>Parameter sensitivity</subject><subject>Supervised learning</subject><subject>Taxonomy</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNissKwjAQRYMgWLT_MOC60Katurc-9la3EnFip9SkZhIXfr1V_AAXh3vgnpGIZJ5nyaqQciJi5jZNU7lYyrLMI3GssHuSuUGF2AMZb8E3CDs06FRHL-XJGrAaTsQfq50yrK27o2MI5ooOKmLv6BK-5aEh7Xkmxlp1jPFvp2K-3dTrfdI7-wjI_tza4MxwnWVZZJkcKPL_qjfrBkEs</recordid><startdate>20220308</startdate><enddate>20220308</enddate><creator>Zhang, Chongzhi</creator><creator>Zhang, Mingyuan</creator><creator>Zhang, Shanghang</creator><creator>Jin, Daisheng</creator><creator>Zhou, Qiang</creator><creator>Cai, Zhongang</creator><creator>Zhao, Haiyu</creator><creator>Liu, Xianglong</creator><creator>Liu, Ziwei</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20220308</creationdate><title>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</title><author>Zhang, Chongzhi ; Zhang, Mingyuan ; Zhang, Shanghang ; Jin, Daisheng ; Zhou, Qiang ; Cai, Zhongang ; Zhao, Haiyu ; Liu, Xianglong ; Liu, Ziwei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25411241143</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Information theory</topic><topic>Parameter sensitivity</topic><topic>Supervised learning</topic><topic>Taxonomy</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Chongzhi</creatorcontrib><creatorcontrib>Zhang, Mingyuan</creatorcontrib><creatorcontrib>Zhang, Shanghang</creatorcontrib><creatorcontrib>Jin, Daisheng</creatorcontrib><creatorcontrib>Zhou, Qiang</creatorcontrib><creatorcontrib>Cai, Zhongang</creatorcontrib><creatorcontrib>Zhao, Haiyu</creatorcontrib><creatorcontrib>Liu, Xianglong</creatorcontrib><creatorcontrib>Liu, Ziwei</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Chongzhi</au><au>Zhang, Mingyuan</au><au>Zhang, Shanghang</au><au>Jin, Daisheng</au><au>Zhou, Qiang</au><au>Cai, Zhongang</au><au>Zhao, Haiyu</au><au>Liu, Xianglong</au><au>Liu, Ziwei</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</atitle><jtitle>arXiv.org</jtitle><date>2022-03-08</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2022-03
issn 2331-8422
language eng
recordid cdi_proquest_journals_2541124114
source Free E- Journals
subjects Information theory
Parameter sensitivity
Supervised learning
Taxonomy
Training
title Delving Deep into the Generalization of Vision Transformers under Distribution Shifts
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T19%3A25%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Delving%20Deep%20into%20the%20Generalization%20of%20Vision%20Transformers%20under%20Distribution%20Shifts&rft.jtitle=arXiv.org&rft.au=Zhang,%20Chongzhi&rft.date=2022-03-08&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2541124114%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2541124114&rft_id=info:pmid/&rfr_iscdi=true