The Role of ViT Design and Training in Robustness Towards Common Corruptions

Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2024-12, p.1-13
Hauptverfasser: Tian, Rui, Wu, Zuxuan, Dai, Qi, Goldblum, Micah, Hu, Han, Jiang, Yu-Gang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Vision Transformer (ViT) variants have made rapid advances in a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask just how these modern architectural developments affect performance under the common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, the exact augmentation strategies that make ViTs more robust are worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. On top of that, we introduce a novel conditional method of generating dynamic augmentation parameters conditioned on input images, offering state-of-the-art robustness towards common corruptions.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2024.3521721