SAVE: Encoding spatial interactions for vision transformers

Transformers have achieved impressive performance in visual tasks. Position encoding, which equips vectors (elements of input tokens, queries, keys, or values) with sequence specificity, effectively alleviates the lack of permutation relation in transformers. In this work, we first clarify that both...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Image and vision computing 2024-12, Vol.152, p.105312, Article 105312
Hauptverfasser: Ma, Xiao, Zhang, Zetian, Yu, Rong, Ji, Zexuan, Li, Mingchao, Zhang, Yuhan, Chen, Qiang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Transformers have achieved impressive performance in visual tasks. Position encoding, which equips vectors (elements of input tokens, queries, keys, or values) with sequence specificity, effectively alleviates the lack of permutation relation in transformers. In this work, we first clarify that both position encoding and additional position-specific operations will introduce positional information when participating in self-attention. On this basis, most existing position encoding methods are equivalent to special affine transformations. However, this encoding method lacks the correlation of vector content interaction. We further propose Spatial Aggregation Vector Encoding (SAVE) that employs transition matrices to recombine vectors. We design two simple yet effective modes to merge other vectors, with each one serving as an anchor. The aggregated vectors control spatial contextual connections by establishing two-dimensional relationships. Our SAVE can be plug-and-play in vision transformers, even with other position encoding methods. Comparative results on three image classification datasets show that the proposed SAVE performs comparably to current position encoding methods. Experiments on detection tasks show that the SAVE improves the downstream performance of transformer-based methods. Code is available at https://github.com/maxiao0234/SAVE. •This study unifies positional information with special affine transformations.•This study proposes SAVE, a new positional paradigm via matrices transformation.•This study introduces two 2D spatial modes for vision transformers based on SAVE.•This method can plug into various transformer structures, enhancing performance.
ISSN:0262-8856
DOI:10.1016/j.imavis.2024.105312