Delving Deep into the Generalization of Vision Transformers under Distribution Shifts
Vision Transformers (ViTs) have achieved impressive performance on various vision tasks, yet their generalization under distribution shifts (DS) is rarely understood. In this work, we comprehensively study the out-of-distribution (OOD) generalization of ViTs. For systematic investigation, we first p...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Vision Transformers (ViTs) have achieved impressive performance on various
vision tasks, yet their generalization under distribution shifts (DS) is rarely
understood. In this work, we comprehensively study the out-of-distribution
(OOD) generalization of ViTs. For systematic investigation, we first present a
taxonomy of DS. We then perform extensive evaluations of ViT variants under
different DS and compare their generalization with Convolutional Neural Network
(CNN) models. Important observations are obtained: 1) ViTs learn weaker biases
on backgrounds and textures, while they are equipped with stronger inductive
biases towards shapes and structures, which is more consistent with human
cognitive traits. Therefore, ViTs generalize better than CNNs under DS. With
the same or less amount of parameters, ViTs are ahead of corresponding CNNs by
more than 5% in top-1 accuracy under most types of DS. 2) As the model scale
increases, ViTs strengthen these biases and thus gradually narrow the
in-distribution and OOD performance gap. To further improve the generalization
of ViTs, we design the Generalization-Enhanced ViTs (GE-ViTs) from the
perspectives of adversarial learning, information theory, and self-supervised
learning. By comprehensively investigating these GE-ViTs and comparing with
their corresponding CNN models, we observe: 1) For the enhanced model, larger
ViTs still benefit more for the OOD generalization. 2) GE-ViTs are more
sensitive to the hyper-parameters than their corresponding CNN models. We
design a smoother learning strategy to achieve a stable training process and
obtain performance improvements on OOD data by 4% from vanilla ViTs. We hope
our comprehensive study could shed light on the design of more generalizable
learning architectures. |
---|---|
DOI: | 10.48550/arxiv.2106.07617 |