Delving Deep into the Generalization of Vision Transformers under Distribution Shifts
Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first p...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2022-03 |
---|---|
Hauptverfasser: | , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Zhang, Chongzhi Zhang, Mingyuan Zhang, Shanghang Jin, Daisheng Zhou, Qiang Cai, Zhongang Zhao, Haiyu Liu, Xianglong Liu, Ziwei |
description | Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2541124114</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2541124114</sourcerecordid><originalsourceid>FETCH-proquest_journals_25411241143</originalsourceid><addsrcrecordid>eNqNissKwjAQRYMgWLT_MOC60Katurc-9la3EnFip9SkZhIXfr1V_AAXh3vgnpGIZJ5nyaqQciJi5jZNU7lYyrLMI3GssHuSuUGF2AMZb8E3CDs06FRHL-XJGrAaTsQfq50yrK27o2MI5ooOKmLv6BK-5aEh7Xkmxlp1jPFvp2K-3dTrfdI7-wjI_tza4MxwnWVZZJkcKPL_qjfrBkEs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2541124114</pqid></control><display><type>article</type><title>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</title><source>Free E- Journals</source><creator>Zhang, Chongzhi ; Zhang, Mingyuan ; Zhang, Shanghang ; Jin, Daisheng ; Zhou, Qiang ; Cai, Zhongang ; Zhao, Haiyu ; Liu, Xianglong ; Liu, Ziwei</creator><creatorcontrib>Zhang, Chongzhi ; Zhang, Mingyuan ; Zhang, Shanghang ; Jin, Daisheng ; Zhou, Qiang ; Cai, Zhongang ; Zhao, Haiyu ; Liu, Xianglong ; Liu, Ziwei</creatorcontrib><description>Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Information theory ; Parameter sensitivity ; Supervised learning ; Taxonomy ; Training</subject><ispartof>arXiv.org, 2022-03</ispartof><rights>2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Zhang, Chongzhi</creatorcontrib><creatorcontrib>Zhang, Mingyuan</creatorcontrib><creatorcontrib>Zhang, Shanghang</creatorcontrib><creatorcontrib>Jin, Daisheng</creatorcontrib><creatorcontrib>Zhou, Qiang</creatorcontrib><creatorcontrib>Cai, Zhongang</creatorcontrib><creatorcontrib>Zhao, Haiyu</creatorcontrib><creatorcontrib>Liu, Xianglong</creatorcontrib><creatorcontrib>Liu, Ziwei</creatorcontrib><title>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</title><title>arXiv.org</title><description>Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.</description><subject>Information theory</subject><subject>Parameter sensitivity</subject><subject>Supervised learning</subject><subject>Taxonomy</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNissKwjAQRYMgWLT_MOC60Katurc-9la3EnFip9SkZhIXfr1V_AAXh3vgnpGIZJ5nyaqQciJi5jZNU7lYyrLMI3GssHuSuUGF2AMZb8E3CDs06FRHL-XJGrAaTsQfq50yrK27o2MI5ooOKmLv6BK-5aEh7Xkmxlp1jPFvp2K-3dTrfdI7-wjI_tza4MxwnWVZZJkcKPL_qjfrBkEs</recordid><startdate>20220308</startdate><enddate>20220308</enddate><creator>Zhang, Chongzhi</creator><creator>Zhang, Mingyuan</creator><creator>Zhang, Shanghang</creator><creator>Jin, Daisheng</creator><creator>Zhou, Qiang</creator><creator>Cai, Zhongang</creator><creator>Zhao, Haiyu</creator><creator>Liu, Xianglong</creator><creator>Liu, Ziwei</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20220308</creationdate><title>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</title><author>Zhang, Chongzhi ; Zhang, Mingyuan ; Zhang, Shanghang ; Jin, Daisheng ; Zhou, Qiang ; Cai, Zhongang ; Zhao, Haiyu ; Liu, Xianglong ; Liu, Ziwei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25411241143</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Information theory</topic><topic>Parameter sensitivity</topic><topic>Supervised learning</topic><topic>Taxonomy</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Chongzhi</creatorcontrib><creatorcontrib>Zhang, Mingyuan</creatorcontrib><creatorcontrib>Zhang, Shanghang</creatorcontrib><creatorcontrib>Jin, Daisheng</creatorcontrib><creatorcontrib>Zhou, Qiang</creatorcontrib><creatorcontrib>Cai, Zhongang</creatorcontrib><creatorcontrib>Zhao, Haiyu</creatorcontrib><creatorcontrib>Liu, Xianglong</creatorcontrib><creatorcontrib>Liu, Ziwei</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Chongzhi</au><au>Zhang, Mingyuan</au><au>Zhang, Shanghang</au><au>Jin, Daisheng</au><au>Zhou, Qiang</au><au>Cai, Zhongang</au><au>Zhao, Haiyu</au><au>Liu, Xianglong</au><au>Liu, Ziwei</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Delving Deep into the Generalization of Vision Transformers under Distribution Shifts</atitle><jtitle>arXiv.org</jtitle><date>2022-03-08</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first present a taxonomy of DS. We then perform extensive evaluations of ViT variants under different DS and compare their generalization with Convolutional Neural Network (CNN) models. Important observations are obtained: 1) ViTs learn weaker biases on backgrounds and textures, while they are equipped with stronger inductive biases towards shapes and structures, which is more consistent with human cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With the same or less amount of parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most types of DS. 2) As the model scale increases, ViTs strengthen these biases and thus gradually narrow the in-distribution and OOD performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the perspectives of adversarial learning, information theory, and self-supervised learning. By comprehensively investigating these GE-ViTs and comparing with their corresponding CNN models, we observe: 1) For the enhanced model, larger ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more sensitive to the hyper-parameters than their corresponding CNN models. We design a smoother learning strategy to achieve a stable training process and obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2022-03 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2541124114 |
source | Free E- Journals |
subjects | Information theory Parameter sensitivity Supervised learning Taxonomy Training |
title | Delving Deep into the Generalization of Vision Transformers under Distribution Shifts |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T19%3A25%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Delving%20Deep%20into%20the%20Generalization%20of%20Vision%20Transformers%20under%20Distribution%20Shifts&rft.jtitle=arXiv.org&rft.au=Zhang,%20Chongzhi&rft.date=2022-03-08&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2541124114%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2541124114&rft_id=info:pmid/&rfr_iscdi=true |