The Role of ViT Design and Training in Robustness Towards Common Corruptions
Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on multimedia 2024-12, p.1-13 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 13 |
---|---|
container_issue | |
container_start_page | 1 |
container_title | IEEE transactions on multimedia |
container_volume | |
creator | Tian, Rui Wu, Zuxuan Dai, Qi Goldblum, Micah Hu, Han Jiang, Yu-Gang |
description | Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask just how these modern architectural developments affect performance under the common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, the exact augmentation strategies that make ViTs more robust are worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. On top of that, we introduce a novel conditional method of generating dynamic augmentation parameters conditioned on input images, offering state-of-the-art robustness towards common corruptions. |
doi_str_mv | 10.1109/TMM.2024.3521721 |
format | Article |
fullrecord | <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TMM_2024_3521721</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10812859</ieee_id><sourcerecordid>10_1109_TMM_2024_3521721</sourcerecordid><originalsourceid>FETCH-LOGICAL-c629-812c1b8784f5bf811994be76d105274c7714cbe4ae63874c9402d504f4f339e3</originalsourceid><addsrcrecordid>eNpNkE9LAzEQxYMoWKt3Dx7yBbbOZJNmc5SqVWgRdPG67J9JjbRJSVrEb29Ke_D05sF7w-PH2C3CBBHMfb1cTgQIOSmVQC3wjI3QSCwAtD7PtxJQGIFwya5S-gZAqUCP2KL-Iv4e1sSD5Z-u5o-U3Mrz1g-8jq3zzq-48znS7dPOU0q8Dj9tHBKfhc0m-Cwx7rc7F3y6Zhe2XSe6OemYfTw_1bOXYvE2f509LIp-KkxRoeixq3QlrepshWiM7EhPBwQltOy1Rtl3JFuallX2RoIYFEgrbVkaKscMjl_7GFKKZJttdJs2_jYIzYFFk1k0BxbNiUWu3B0rjoj-xfOUSpnyDxO9Wac</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>The Role of ViT Design and Training in Robustness Towards Common Corruptions</title><source>IEEE Electronic Library (IEL)</source><creator>Tian, Rui ; Wu, Zuxuan ; Dai, Qi ; Goldblum, Micah ; Hu, Han ; Jiang, Yu-Gang</creator><creatorcontrib>Tian, Rui ; Wu, Zuxuan ; Dai, Qi ; Goldblum, Micah ; Hu, Han ; Jiang, Yu-Gang</creatorcontrib><description>Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask just how these modern architectural developments affect performance under the common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, the exact augmentation strategies that make ViTs more robust are worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. On top of that, we introduce a novel conditional method of generating dynamic augmentation parameters conditioned on input images, offering state-of-the-art robustness towards common corruptions.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2024.3521721</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>IEEE</publisher><subject>Accuracy ; Benchmark testing ; Common Corruptions ; Computer vision ; Data augmentation ; Noise ; Resilience ; Robustness ; Training ; Transformers ; Vision Transformer</subject><ispartof>IEEE transactions on multimedia, 2024-12, p.1-13</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-8689-5807 ; 0000-0001-9074-0667 ; 0000-0002-1907-8567 ; 0000-0002-4693-2968 ; 0000-0001-5104-6146</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10812859$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10812859$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tian, Rui</creatorcontrib><creatorcontrib>Wu, Zuxuan</creatorcontrib><creatorcontrib>Dai, Qi</creatorcontrib><creatorcontrib>Goldblum, Micah</creatorcontrib><creatorcontrib>Hu, Han</creatorcontrib><creatorcontrib>Jiang, Yu-Gang</creatorcontrib><title>The Role of ViT Design and Training in Robustness Towards Common Corruptions</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask just how these modern architectural developments affect performance under the common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, the exact augmentation strategies that make ViTs more robust are worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. On top of that, we introduce a novel conditional method of generating dynamic augmentation parameters conditioned on input images, offering state-of-the-art robustness towards common corruptions.</description><subject>Accuracy</subject><subject>Benchmark testing</subject><subject>Common Corruptions</subject><subject>Computer vision</subject><subject>Data augmentation</subject><subject>Noise</subject><subject>Resilience</subject><subject>Robustness</subject><subject>Training</subject><subject>Transformers</subject><subject>Vision Transformer</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE9LAzEQxYMoWKt3Dx7yBbbOZJNmc5SqVWgRdPG67J9JjbRJSVrEb29Ke_D05sF7w-PH2C3CBBHMfb1cTgQIOSmVQC3wjI3QSCwAtD7PtxJQGIFwya5S-gZAqUCP2KL-Iv4e1sSD5Z-u5o-U3Mrz1g-8jq3zzq-48znS7dPOU0q8Dj9tHBKfhc0m-Cwx7rc7F3y6Zhe2XSe6OemYfTw_1bOXYvE2f509LIp-KkxRoeixq3QlrepshWiM7EhPBwQltOy1Rtl3JFuallX2RoIYFEgrbVkaKscMjl_7GFKKZJttdJs2_jYIzYFFk1k0BxbNiUWu3B0rjoj-xfOUSpnyDxO9Wac</recordid><startdate>20241221</startdate><enddate>20241221</enddate><creator>Tian, Rui</creator><creator>Wu, Zuxuan</creator><creator>Dai, Qi</creator><creator>Goldblum, Micah</creator><creator>Hu, Han</creator><creator>Jiang, Yu-Gang</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-8689-5807</orcidid><orcidid>https://orcid.org/0000-0001-9074-0667</orcidid><orcidid>https://orcid.org/0000-0002-1907-8567</orcidid><orcidid>https://orcid.org/0000-0002-4693-2968</orcidid><orcidid>https://orcid.org/0000-0001-5104-6146</orcidid></search><sort><creationdate>20241221</creationdate><title>The Role of ViT Design and Training in Robustness Towards Common Corruptions</title><author>Tian, Rui ; Wu, Zuxuan ; Dai, Qi ; Goldblum, Micah ; Hu, Han ; Jiang, Yu-Gang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c629-812c1b8784f5bf811994be76d105274c7714cbe4ae63874c9402d504f4f339e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Benchmark testing</topic><topic>Common Corruptions</topic><topic>Computer vision</topic><topic>Data augmentation</topic><topic>Noise</topic><topic>Resilience</topic><topic>Robustness</topic><topic>Training</topic><topic>Transformers</topic><topic>Vision Transformer</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tian, Rui</creatorcontrib><creatorcontrib>Wu, Zuxuan</creatorcontrib><creatorcontrib>Dai, Qi</creatorcontrib><creatorcontrib>Goldblum, Micah</creatorcontrib><creatorcontrib>Hu, Han</creatorcontrib><creatorcontrib>Jiang, Yu-Gang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tian, Rui</au><au>Wu, Zuxuan</au><au>Dai, Qi</au><au>Goldblum, Micah</au><au>Hu, Han</au><au>Jiang, Yu-Gang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The Role of ViT Design and Training in Robustness Towards Common Corruptions</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2024-12-21</date><risdate>2024</risdate><spage>1</spage><epage>13</epage><pages>1-13</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask just how these modern architectural developments affect performance under the common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, the exact augmentation strategies that make ViTs more robust are worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. On top of that, we introduce a novel conditional method of generating dynamic augmentation parameters conditioned on input images, offering state-of-the-art robustness towards common corruptions.</abstract><pub>IEEE</pub><doi>10.1109/TMM.2024.3521721</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-8689-5807</orcidid><orcidid>https://orcid.org/0000-0001-9074-0667</orcidid><orcidid>https://orcid.org/0000-0002-1907-8567</orcidid><orcidid>https://orcid.org/0000-0002-4693-2968</orcidid><orcidid>https://orcid.org/0000-0001-5104-6146</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1520-9210 |
ispartof | IEEE transactions on multimedia, 2024-12, p.1-13 |
issn | 1520-9210 1941-0077 |
language | eng |
recordid | cdi_crossref_primary_10_1109_TMM_2024_3521721 |
source | IEEE Electronic Library (IEL) |
subjects | Accuracy Benchmark testing Common Corruptions Computer vision Data augmentation Noise Resilience Robustness Training Transformers Vision Transformer |
title | The Role of ViT Design and Training in Robustness Towards Common Corruptions |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T17%3A41%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20Role%20of%20ViT%20Design%20and%20Training%20in%20Robustness%20Towards%20Common%20Corruptions&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Tian,%20Rui&rft.date=2024-12-21&rft.spage=1&rft.epage=13&rft.pages=1-13&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2024.3521721&rft_dat=%3Ccrossref_RIE%3E10_1109_TMM_2024_3521721%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10812859&rfr_iscdi=true |