The Role of ViT Design and Training in Robustness Towards Common Corruptions

Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2024-12, p.1-13
Hauptverfasser: Tian, Rui, Wu, Zuxuan, Dai, Qi, Goldblum, Micah, Hu, Han, Jiang, Yu-Gang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 13
container_issue
container_start_page 1
container_title IEEE transactions on multimedia
container_volume
creator Tian, Rui
Wu, Zuxuan
Dai, Qi
Goldblum, Micah
Hu, Han
Jiang, Yu-Gang
description Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask just how these modern architectural developments affect performance under the common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, the exact augmentation strategies that make ViTs more robust are worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. On top of that, we introduce a novel conditional method of generating dynamic augmentation parameters conditioned on input images, offering state-of-the-art robustness towards common corruptions.
doi_str_mv 10.1109/TMM.2024.3521721
format Article
fullrecord <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TMM_2024_3521721</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10812859</ieee_id><sourcerecordid>10_1109_TMM_2024_3521721</sourcerecordid><originalsourceid>FETCH-LOGICAL-c629-812c1b8784f5bf811994be76d105274c7714cbe4ae63874c9402d504f4f339e3</originalsourceid><addsrcrecordid>eNpNkE9LAzEQxYMoWKt3Dx7yBbbOZJNmc5SqVWgRdPG67J9JjbRJSVrEb29Ke_D05sF7w-PH2C3CBBHMfb1cTgQIOSmVQC3wjI3QSCwAtD7PtxJQGIFwya5S-gZAqUCP2KL-Iv4e1sSD5Z-u5o-U3Mrz1g-8jq3zzq-48znS7dPOU0q8Dj9tHBKfhc0m-Cwx7rc7F3y6Zhe2XSe6OemYfTw_1bOXYvE2f509LIp-KkxRoeixq3QlrepshWiM7EhPBwQltOy1Rtl3JFuallX2RoIYFEgrbVkaKscMjl_7GFKKZJttdJs2_jYIzYFFk1k0BxbNiUWu3B0rjoj-xfOUSpnyDxO9Wac</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>The Role of ViT Design and Training in Robustness Towards Common Corruptions</title><source>IEEE Electronic Library (IEL)</source><creator>Tian, Rui ; Wu, Zuxuan ; Dai, Qi ; Goldblum, Micah ; Hu, Han ; Jiang, Yu-Gang</creator><creatorcontrib>Tian, Rui ; Wu, Zuxuan ; Dai, Qi ; Goldblum, Micah ; Hu, Han ; Jiang, Yu-Gang</creatorcontrib><description>Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask just how these modern architectural developments affect performance under the common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, the exact augmentation strategies that make ViTs more robust are worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. On top of that, we introduce a novel conditional method of generating dynamic augmentation parameters conditioned on input images, offering state-of-the-art robustness towards common corruptions.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2024.3521721</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Benchmark testing ; Common Corruptions ; Computer vision ; Data augmentation ; Noise ; Resilience ; Robustness ; Training ; Transformers ; Vision Transformer</subject><ispartof>IEEE transactions on multimedia, 2024-12, p.1-13</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-8689-5807 ; 0000-0001-9074-0667 ; 0000-0002-1907-8567 ; 0000-0002-4693-2968 ; 0000-0001-5104-6146</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10812859$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10812859$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tian, Rui</creatorcontrib><creatorcontrib>Wu, Zuxuan</creatorcontrib><creatorcontrib>Dai, Qi</creatorcontrib><creatorcontrib>Goldblum, Micah</creatorcontrib><creatorcontrib>Hu, Han</creatorcontrib><creatorcontrib>Jiang, Yu-Gang</creatorcontrib><title>The Role of ViT Design and Training in Robustness Towards Common Corruptions</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask just how these modern architectural developments affect performance under the common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, the exact augmentation strategies that make ViTs more robust are worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. On top of that, we introduce a novel conditional method of generating dynamic augmentation parameters conditioned on input images, offering state-of-the-art robustness towards common corruptions.</description><subject>Accuracy</subject><subject>Benchmark testing</subject><subject>Common Corruptions</subject><subject>Computer vision</subject><subject>Data augmentation</subject><subject>Noise</subject><subject>Resilience</subject><subject>Robustness</subject><subject>Training</subject><subject>Transformers</subject><subject>Vision Transformer</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE9LAzEQxYMoWKt3Dx7yBbbOZJNmc5SqVWgRdPG67J9JjbRJSVrEb29Ke_D05sF7w-PH2C3CBBHMfb1cTgQIOSmVQC3wjI3QSCwAtD7PtxJQGIFwya5S-gZAqUCP2KL-Iv4e1sSD5Z-u5o-U3Mrz1g-8jq3zzq-48znS7dPOU0q8Dj9tHBKfhc0m-Cwx7rc7F3y6Zhe2XSe6OemYfTw_1bOXYvE2f509LIp-KkxRoeixq3QlrepshWiM7EhPBwQltOy1Rtl3JFuallX2RoIYFEgrbVkaKscMjl_7GFKKZJttdJs2_jYIzYFFk1k0BxbNiUWu3B0rjoj-xfOUSpnyDxO9Wac</recordid><startdate>20241221</startdate><enddate>20241221</enddate><creator>Tian, Rui</creator><creator>Wu, Zuxuan</creator><creator>Dai, Qi</creator><creator>Goldblum, Micah</creator><creator>Hu, Han</creator><creator>Jiang, Yu-Gang</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-8689-5807</orcidid><orcidid>https://orcid.org/0000-0001-9074-0667</orcidid><orcidid>https://orcid.org/0000-0002-1907-8567</orcidid><orcidid>https://orcid.org/0000-0002-4693-2968</orcidid><orcidid>https://orcid.org/0000-0001-5104-6146</orcidid></search><sort><creationdate>20241221</creationdate><title>The Role of ViT Design and Training in Robustness Towards Common Corruptions</title><author>Tian, Rui ; Wu, Zuxuan ; Dai, Qi ; Goldblum, Micah ; Hu, Han ; Jiang, Yu-Gang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c629-812c1b8784f5bf811994be76d105274c7714cbe4ae63874c9402d504f4f339e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Benchmark testing</topic><topic>Common Corruptions</topic><topic>Computer vision</topic><topic>Data augmentation</topic><topic>Noise</topic><topic>Resilience</topic><topic>Robustness</topic><topic>Training</topic><topic>Transformers</topic><topic>Vision Transformer</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tian, Rui</creatorcontrib><creatorcontrib>Wu, Zuxuan</creatorcontrib><creatorcontrib>Dai, Qi</creatorcontrib><creatorcontrib>Goldblum, Micah</creatorcontrib><creatorcontrib>Hu, Han</creatorcontrib><creatorcontrib>Jiang, Yu-Gang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tian, Rui</au><au>Wu, Zuxuan</au><au>Dai, Qi</au><au>Goldblum, Micah</au><au>Hu, Han</au><au>Jiang, Yu-Gang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Role of ViT Design and Training in Robustness Towards Common Corruptions</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2024-12-21</date><risdate>2024</risdate><spage>1</spage><epage>13</epage><pages>1-13</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask just how these modern architectural developments affect performance under the common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, the exact augmentation strategies that make ViTs more robust are worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. On top of that, we introduce a novel conditional method of generating dynamic augmentation parameters conditioned on input images, offering state-of-the-art robustness towards common corruptions.</abstract><pub>IEEE</pub><doi>10.1109/TMM.2024.3521721</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-8689-5807</orcidid><orcidid>https://orcid.org/0000-0001-9074-0667</orcidid><orcidid>https://orcid.org/0000-0002-1907-8567</orcidid><orcidid>https://orcid.org/0000-0002-4693-2968</orcidid><orcidid>https://orcid.org/0000-0001-5104-6146</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1520-9210
ispartof IEEE transactions on multimedia, 2024-12, p.1-13
issn 1520-9210
1941-0077
language eng
recordid cdi_crossref_primary_10_1109_TMM_2024_3521721
source IEEE Electronic Library (IEL)
subjects Accuracy
Benchmark testing
Common Corruptions
Computer vision
Data augmentation
Noise
Resilience
Robustness
Training
Transformers
Vision Transformer
title The Role of ViT Design and Training in Robustness Towards Common Corruptions
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T17%3A41%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Role%20of%20ViT%20Design%20and%20Training%20in%20Robustness%20Towards%20Common%20Corruptions&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Tian,%20Rui&rft.date=2024-12-21&rft.spage=1&rft.epage=13&rft.pages=1-13&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2024.3521721&rft_dat=%3Ccrossref_RIE%3E10_1109_TMM_2024_3521721%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10812859&rfr_iscdi=true