CrossFormer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention

While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short d...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence 2024-05, Vol.46 (5), p.3123-3136
Hauptverfasser:	Wang, Wenxiao, Chen, Wei, Qiu, Qibo, Chen, Long, Wu, Boxi, Lin, Binbin, He, Xiaofei, Liu, Wei
Format:	Artikel
Sprache:	eng
Schlagworte:	Adaptation models Amplitudes Costs Feature extraction Image classification Image segmentation Instance segmentation Modules Object detection Object recognition Semantic segmentation Task analysis Transformers Vision vision transformer Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage of them explicitly. To this end, we first propose a cross-scale vision transformer, CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the tokens. Moreover, through experiments on CrossFormer, we observe another two issues that affect vision transformers' performance, i.e., the enlarging self-attention maps and amplitude explosion. Thus, we further propose a progressive group size (PGS) paradigm and an amplitude cooling layer (ACL) to alleviate the two issues, respectively. The CrossFormer incorporating with PGS and ACL is called CrossFormer++. Extensive experiments show that CrossFormer++ outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks.
ISSN:	0162-8828 1939-3539 2160-9292
DOI:	10.1109/TPAMI.2023.3341806